SparkR Reading Tables from Hive
Hi there, I’m testing out the new SparkR-Hive interop right now. I’m noticing an apparent disconnect between the Hive store I have my data loaded and the store that sparkRHIve.init() connects to. For example, in beeline: 0: jdbc:hive2://quickstart.cloudera:1 show databases; +---+--+ | database_name | +---+--+ | default | +---+--+ 0: jdbc:hive2://quickstart.cloudera:1 show tables; +---+--+ | tab_name | +---+--+ | my_table | +---+--+ But in sparkR: hqlContext - sparkRHive.init(sc) showDF(sql(hqlContext, “show databases”)) +-+ | result | +-+ | default | +-+ showDF(tables(hqlContext, “default”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ showDF(sql(hqlContext, “show tables”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ The data in my_table was landed into Hive from a CSV via kite-dataset. The installation of Spark I’m working with was built separately, and operates as standalone. Could it be that sparkRHive.init() is getting the wrong address of the Hive metastore? How could I peer into the context and see what the address is set to, and if it’s wrong, reset it? Ultimately, I’d like to be able to read my_table from Hive into a SparkR DataFrame which ought to be possible with result - sql(hqlContext, “SELECT * FROM my_table”) But this fails with: org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos 14 which is expected, I suppose, since we don’t see the table in the listing above. Any thoughts? Thanks, Alek Eskilson CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: SparkR Reading Tables from Hive
Resolved, my hive-site.xml wasn’t in the conf folder. I can load tables into DataFrames as expected. Thanks, Alek From: Eskilson, Aleksander Eskilson alek.eskil...@cerner.commailto:alek.eskil...@cerner.com Date: Monday, June 8, 2015 at 3:38 PM To: dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org Subject: SparkR Reading Tables from Hive Hi there, I’m testing out the new SparkR-Hive interop right now. I’m noticing an apparent disconnect between the Hive store I have my data loaded and the store that sparkRHIve.init() connects to. For example, in beeline: 0: jdbc:hive2://quickstart.cloudera:1 show databases; +---+--+ | database_name | +---+--+ | default | +---+--+ 0: jdbc:hive2://quickstart.cloudera:1 show tables; +---+--+ | tab_name | +---+--+ | my_table | +---+--+ But in sparkR: hqlContext - sparkRHive.init(sc) showDF(sql(hqlContext, “show databases”)) +-+ | result | +-+ | default | +-+ showDF(tables(hqlContext, “default”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ showDF(sql(hqlContext, “show tables”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ The data in my_table was landed into Hive from a CSV via kite-dataset. The installation of Spark I’m working with was built separately, and operates as standalone. Could it be that sparkRHive.init() is getting the wrong address of the Hive metastore? How could I peer into the context and see what the address is set to, and if it’s wrong, reset it? Ultimately, I’d like to be able to read my_table from Hive into a SparkR DataFrame which ought to be possible with result - sql(hqlContext, “SELECT * FROM my_table”) But this fails with: org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos 14 which is expected, I suppose, since we don’t see the table in the listing above. Any thoughts? Thanks, Alek Eskilson CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: SparkR Reading Tables from Hive
Thanks for the confirmation - I was just going to send a pointer to the documentation that talks about hive-site.xml. http://people.apache.org/~pwendell/spark-releases/latest/sql-programming-guide.html#hive-tables Thanks Shivaram On Mon, Jun 8, 2015 at 1:57 PM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: Resolved, my hive-site.xml wasn’t in the conf folder. I can load tables into DataFrames as expected. Thanks, Alek From: Eskilson, Aleksander Eskilson alek.eskil...@cerner.com Date: Monday, June 8, 2015 at 3:38 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR Reading Tables from Hive Hi there, I’m testing out the new SparkR-Hive interop right now. I’m noticing an apparent disconnect between the Hive store I have my data loaded and the store that sparkRHIve.init() connects to. For example, in beeline: 0: jdbc:hive2://quickstart.cloudera:1 show databases; +---+--+ | database_name | +---+--+ | default | +---+--+ 0: jdbc:hive2://quickstart.cloudera:1 show tables; +---+--+ | tab_name | +---+--+ | my_table | +---+--+ But in sparkR: hqlContext - sparkRHive.init(sc) showDF(sql(hqlContext, “show databases”)) +-+ | result | +-+ | default | +-+ showDF(tables(hqlContext, “default”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ showDF(sql(hqlContext, “show tables”)) +---+-+ + tableName | isTemporary | +---+-+ +---+-+ The data in my_table was landed into Hive from a CSV via kite-dataset. The installation of Spark I’m working with was built separately, and operates as standalone. Could it be that sparkRHive.init() is getting the wrong address of the Hive metastore? How could I peer into the context and see what the address is set to, and if it’s wrong, reset it? Ultimately, I’d like to be able to read my_table from Hive into a SparkR DataFrame which ought to be possible with result - sql(hqlContext, “SELECT * FROM my_table”) But this fails with: org.apache.spark.sql.AnalysisException: no such table my_table; line 1 pos 14 which is expected, I suppose, since we don’t see the table in the listing above. Any thoughts? Thanks, Alek Eskilson CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Fwd: pull requests no longer closing by commit messages with closes #xxxx
FYI. -- Forwarded message -- From: John Greet (GitHub Staff) supp...@github.com Date: Mon, Jun 8, 2015 at 5:50 PM Subject: Re: pull requests no longer closing by commit messages with closes # To: Reynold Xin r...@databricks.com Hi Reynold, The problem here is that the commits closing those pull requests were fetched by our mirroring process, which doesn't have permission to close issues, instead of pushed by a user in the apache GitHub organization. Usually the repository receives regular, I assume automated, pushes to its master branch, but there was a gap in those pushes between 2pm PDT on June 5th and 1:16 PM PDT June 7th. This happened at least once before back in November. Now that those pushes have resumed pull requests are closing normally once again. Let us know if you have any other questions. Cheers, John I'm a committer on Apache Spark (the most active open source project in the data space). We use GitHub as the primary way to accept contributions. We use a custom merge script to merge pull requests rather than GitHub's merge button in order to preserve a linear commit history. Part of the merge script relies on the closes # feature to close the corresponding pull requests. I noticed recently that pull requests are no longer automatically closed, even if the commits are merged with the message closes #. Here are two recent examples: https://github.com/apache/spark/pull/6670 https://github.com/apache/spark/pull/6689 Can you take a look at what's going on? Thanks.
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 On Mon, Jun 8, 2015 at 17:51 Wang, Daoyuan daoyuan.w...@intel.com wrote: +1 -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, June 03, 2015 1:47 PM To: dev@spark.apache.org Subject: Re: [VOTE] Release Apache Spark 1.4.0 (RC4) He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact commit and all other information is correct. (thanks Shivaram who pointed this out). On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark- / [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112 / The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs / Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Wednesday, June 03, 2015 1:47 PM To: dev@spark.apache.org Subject: Re: [VOTE] Release Apache Spark 1.4.0 (RC4) He all - a tiny nit from the last e-mail. The tag is v1.4.0-rc4. The exact commit and all other information is correct. (thanks Shivaram who pointed this out). On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark- / [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112 / The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs / Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
+1 Build for Hadoop 2.4. Run a few jobs on YARN and tested spark.sql.unsafe whose performance seems great! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC4-tp12582p12671.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[SparkSQL ] What is Exchange in physical plan for ?
Hi, DataFrame.explain() shows the physical plan of a query. I noticed there are a lot of `Exchange`s in it, like below: Project [period#20L,categoryName#0,regionName#10,action#15,list_id#16L] ShuffledHashJoin [region#18], [regionCode#9], BuildRight Exchange (HashPartitioning [region#18], 12) Project [categoryName#0,list_id#16L,period#20L,action#15,region#18] ShuffledHashJoin [refCategoryID#3], [category#17], BuildRight Exchange (HashPartitioning [refCategoryID#3], 12) Project [categoryName#0,refCategoryID#3] PhysicalRDD [categoryName#0,familyName#1,parentRefCategoryID#2,refCategoryID#3], MapPartitionsRDD[5] at mapPartitions at SQLContext.scala:439 Exchange (HashPartitioning [category#17], 12) Project [timestamp_sec#13L AS period#20L,category#17,region#18,action#15,list_id#16L] PhysicalRDD [syslog#12,timestamp_sec#13L,timestamp_usec#14,action#15,list_id#16L,category#17,region#18,expiration_time#19], MapPartitionsRDD[16] at map at SQLContext.scala:394 Exchange (HashPartitioning [regionCode#9], 12) Project [regionName#10,regionCode#9] PhysicalRDD [cityName#4,countryCode#5,countryName#6,dptCode#7,dptName#8,regionCode#9,regionName#10,zipCode#11], MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:439 I find also its class: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala. So what does it mean ? Thank you. Hao. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-What-is-Exchange-in-physical-plan-for-tp12659.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: [SparkSQL ] What is Exchange in physical plan for ?
It means the data shuffling, and its arguments also show the partitioning strategy. -Original Message- From: invkrh [mailto:inv...@gmail.com] Sent: Monday, June 8, 2015 9:34 PM To: dev@spark.apache.org Subject: [SparkSQL ] What is Exchange in physical plan for ? Hi, DataFrame.explain() shows the physical plan of a query. I noticed there are a lot of `Exchange`s in it, like below: Project [period#20L,categoryName#0,regionName#10,action#15,list_id#16L] ShuffledHashJoin [region#18], [regionCode#9], BuildRight Exchange (HashPartitioning [region#18], 12) Project [categoryName#0,list_id#16L,period#20L,action#15,region#18] ShuffledHashJoin [refCategoryID#3], [category#17], BuildRight Exchange (HashPartitioning [refCategoryID#3], 12) Project [categoryName#0,refCategoryID#3] PhysicalRDD [categoryName#0,familyName#1,parentRefCategoryID#2,refCategoryID#3], MapPartitionsRDD[5] at mapPartitions at SQLContext.scala:439 Exchange (HashPartitioning [category#17], 12) Project [timestamp_sec#13L AS period#20L,category#17,region#18,action#15,list_id#16L] PhysicalRDD [syslog#12,timestamp_sec#13L,timestamp_usec#14,action#15,list_id#16L,category#17,region#18,expiration_time#19], MapPartitionsRDD[16] at map at SQLContext.scala:394 Exchange (HashPartitioning [regionCode#9], 12) Project [regionName#10,regionCode#9] PhysicalRDD [cityName#4,countryCode#5,countryName#6,dptCode#7,dptName#8,regionCode#9,regionName#10,zipCode#11], MapPartitionsRDD[11] at mapPartitions at SQLContext.scala:439 I find also its class: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala. So what does it mean ? Thank you. Hao. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-What-is-Exchange-in-physical-plan-for-tp12659.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.4.0 (RC4)
Hi All, Thanks for the continued voting! I'm going to leave this thread open for another few days to continue to collect feedback. - Patrick On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 22596c534a38cfdda91aef18aa9037ab101e4251 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: [published as version: 1.4.0] https://repository.apache.org/content/repositories/orgapachespark-/ [published as version: 1.4.0-rc4] https://repository.apache.org/content/repositories/orgapachespark-1112/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc4-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Saturday, June 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What has changed since RC3 == In addition to may smaller fixes, three blocker issues were fixed: 4940630 [SPARK-8020] [SQL] Spark SQL conf in spark-defaults.conf make metadataHive get constructed too early 6b0f615 [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise() 78a6723 [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[ml] Why all model classes are final?
Hi, previously all the models in ml package were private to package, so if i need to customize some models i inherit them in org.apache.spark.ml package in my project. But now new models (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala#L46) are final classes. So if i need to customize 1 line or so, i need to redefine the whole class. Any reasons to do so? As a developer,i understand all the risks of using Developer/Alpha API. That's why i'm using spark, because it provides a building blocks that i could easily customize and combine for my need. Thanks, Peter Rudenko - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Stages with non-arithmetic numbering Timing metrics in event logs
Hi Mike, all good questions, let me take a stab at answering them: 1. Event Logs + Stages: Its normal for stages to get skipped if they are shuffle map stages, which get read multiple times. Eg., here's a little example program I wrote earlier to demonstrate this: d3 doesn't need to be re-shuffled since each time its read w/ the same partitioner. So skipping stages in this way is a good thing: val partitioner = new org.apache.spark.HashPartitioner(10) val d3 = sc.parallelize(1 to 100).map { x = (x % 10) - x}.partitionBy(partitioner) (0 until 5).foreach { idx = val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) - x}.partitionBy(partitioner) println(idx + --- + otherData.join(d3).count()) } If you run this, f you look in the UI you'd see that all jobs except for the first one have one stage that is skipped. You will also see this in the log: 15/06/08 10:52:37 INFO DAGScheduler: Parents of final stage: List(Stage 12, Stage 13) 15/06/08 10:52:37 INFO DAGScheduler: Missing parents: List(Stage 13) Admittedly that is not very clear, but that is sort of indicating to you that the DAGScheduler first created stage 12 as a necessary step, and then later on changed its mind by realizing that everything it needed for stage 12 already existed, so there was nothing to do. 2. Extracting Event Log Information maybe you are interested in SparkListener ? Though unfortunately, I don't know of a good blog post describing it, hopefully the docs are clear ... 3. Time Metrics in Spark Event Log This is a great question. I *think* the only exception is that t_gc is really overlapped with t_exec. So I think you should really expect (t2 - t1) (t_ser + t_deser + t_exec) I am not 100% sure about this, though. I'd be curious if that was constraint was ever violated. As for your question on shuffle read vs. shuffle write time -- I wouldn't necessarily expect the same stage to have times for both shuffle read shuffle write -- in the simplest case, you'll have shuffle write times in one, and shuffle read times in the next one. But even taking that into account, there is a difference in the way they work are measured. shuffle read operations are pipelined and the way we measure shuffle read, its just how much time is spent *waiting* for network transfer. It could be that there is no (measurable) wait time b/c the next blocks are fetched before they are needed. Shuffle writes occur in the normal task execution thread, though, so we (try to) measure all of it. On Sun, Jun 7, 2015 at 11:12 PM, Mike Hynes 91m...@gmail.com wrote: Hi Patrick and Akhil, Thank you both for your responses. This is a bit of an extended email, but I'd like to: 1. Answer your (Patrick) note about the missing stages since the IDs do (briefly) appear in the event logs 2. Ask for advice/experience with extracting information from the event logs in a columnar, delimiter-separated format. 3. Ask about the time metrics reported in the event logs; currently, the elapsed time for a task does not equal the sum of the times for its components 1. Event Logs + Stages: = As I said before, In the spark logs (the log4j configurable ones from the driver), I only see references to some stages, where the stage IDs are not arithmetically increasing. In the event logs, however, I will see reference to *every* stage, although not all stages will have tasks associated with them. For instance, to examine the actual stages that have tasks, you can see missing stages: # grep -E 'Event:SparkListenerTaskEnd' app.log \ # | grep -Eo 'Stage ID:[[:digit:]]+' \ # | sort -n|uniq | head -n 5 Stage ID:0 Stage ID:1 Stage ID:10 Stage ID:11 Stage ID:110 However, these missing stages *do* appear in the event logs as Stage IDs in the jobs submitted, i.e: for # grep -E 'Event:SparkListenerJobStart' app.log | grep -Eo 'Stage IDs:\[.*\]' | head -n 5 Stage IDs:[0,1,2] Stage IDs:[5,3,4] Stage IDs:[6,7,8] Stage IDs:[9,10,11] Stage IDs:[12,13,14] I do not know if this amounts to a bug, since I am not familiar with the scheduler in detail. The stages have seemingly been created somewhere in the DAG, but then have no associated tasks and never appear again. 2. Extracting Event Log Information Currently we are running scalability tests, and are finding very poor scalability for certain block matrix algorithms. I would like to have finer detail about the communication time and bandwidth when data is transferred between nodes. I would really just like to have a file with nothing but task info in a format such as: timestamp (ms), task ID, hostname, execution time (ms), GC time (ms), ... 0010294, 1, slave-1, 503, 34, ... 0010392, 2, slave-2, 543, 32, ... and similarly for jobs/stages/rdd_memory/shuffle output/etc. I have extracted the relevant time fields from the spark event logs with a sed script, but I wonder if
[sample code] deeplearning4j for Spark ML (@DeveloperAPI)
The deeplearning4j framework provides a variety of distributed, neural network-based learning algorithms, including convolutional nets, deep auto-encoders, deep-belief nets, and recurrent nets. We’re working on integration with the Spark ML pipeline, leveraging the developer API. This announcement is to share some code and get feedback from the Spark community. The integration code is located in the dl4j-spark-ml module in the deeplearning4j repository. Major aspects of the integration work: ML algorithms. To bind the dl4j algorithms to the ML pipeline, we developed a new classifier and a new unsupervised learning estimator. ML attributes. We strove to interoperate well with other pipeline components. ML Attributes are column-level metadata enabling information sharing between pipeline components.See here how the classifier reads label metadata from a column provided by the new StringIndexer. Large binary data. It is challenging to work with large binary data in Spark. An effective approach is to leverage PrunedScan and to carefully control partition sizes. Here we explored this with a custom data source based on the new relation API. Column-based record readers. Here we explored how to construct rows from a Hadoop input split by composing a number of column-level readers, with pruning support. UDTs. With Spark SQL it is possible to introduce new data types. We prototyped an experimental Tensor type, here. Spark Package. We developed a spark package to make it easy to use the dl4j framework in spark-shell and with spark-submit. See the deeplearning4j/dl4j-spark-ml repository for useful snippets involving the sbt-spark-package plugin. Example code. Examples demonstrate how the standardized ML API simplifies interoperability, such as with label preprocessing and feature scaling. See the deeplearning4j/dl4j-spark-ml-examples repository for an expanding set of example pipelines. Hope this proves useful to the community as we transition to exciting new concepts in Spark SQL and Spark ML. Meanwhile, we have Spark working with multiple GPUs on AWS and we're looking forward to optimizations that will speed neural net training even more. Eron Wright Contributor | deeplearning4j.org