[jira] [Updated] (SPARK-1996) Remove use of special Maven repo for Akka
[ https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1996: -- Assignee: Sean Owen (was: Sean Owen) Remove use of special Maven repo for Akka - Key: SPARK-1996 URL: https://issues.apache.org/jira/browse/SPARK-1996 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Matei Zaharia Assignee: Sean Owen Fix For: 1.1.0 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html Akka is now published to Maven Central, so our documentation and POM files don't need to use the old Akka repo. It will be one less step for users to worry about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info
[ https://issues.apache.org/jira/browse/SPARK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1827: -- Assignee: Sean Owen (was: Sean Owen) LICENSE and NOTICE files need a refresh to contain transitive dependency info - Key: SPARK-1827 URL: https://issues.apache.org/jira/browse/SPARK-1827 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 (Pardon marking it a blocker, but think it needs doing before 1.0 per chat with [~pwendell]) The LICENSE and NOTICE files need to cover all transitive dependencies, since these are all distributed in the assembly jar. (c.f. http://www.apache.org/dev/licensing-howto.html ) I don't believe the current files cover everything. It's possible to mostly-automatically generate these. I will generate this and propose a patch to both today. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278453#comment-14278453 ] Guoqiang Li commented on SPARK-1405: We can use the demo scripts in word2vec to get the same corpus. {code} normalize_text() { awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e s/'/ ' /g -e s/“/\/g -e s/”/\/g \ -e 's// /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ -e 's/«/ /g' | tr 0-9 } wget http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz gzip -d news.2013.en.shuffled.gz normalize_text news.2013.en.shuffled data.txt {code} parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1798) Tests should clean up temp files
[ https://issues.apache.org/jira/browse/SPARK-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1798: -- Assignee: Sean Owen (was: Sean Owen) Tests should clean up temp files Key: SPARK-1798 URL: https://issues.apache.org/jira/browse/SPARK-1798 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 Three issues related to temp files that tests generate -- these should be touched up for hygiene but are not urgent. Modules have a log4j.properties which directs the unit-test.log output file to a directory like [module]/target/unit-test.log. But this ends up creating [module]/[module]/target/unit-test.log instead of former. The work/ directory is not deleted by mvn clean, in the parent and in modules. Neither is the checkpoint/ directory created under the various external modules. Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling deleteOnExit() at creation and trying to call Utils.deleteRecursively consistently to clean up, sometimes in an @After method. (If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of LocalSparkContext, which provides management of temp directories for subclasses to take advantage of.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic
[ https://issues.apache.org/jira/browse/SPARK-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-3356: -- Assignee: Sean Owen (was: Sean Owen) Document when RDD elements' ordering within partitions is nondeterministic -- Key: SPARK-3356 URL: https://issues.apache.org/jira/browse/SPARK-3356 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Matei Zaharia Assignee: Sean Owen Fix For: 1.2.0 As reported in SPARK-3098 for example, for users using zipWithIndex, zipWithUniqueId, etc, (and maybe even things like mapPartitions) it's confusing that the order of elements in each partition after a shuffle operation is nondeterministic (unless the operation was sortByKey). We should explain this in the docs for the zip and partition-wise operations. Another subtle issue is that the order of values for each key in groupBy / join / etc can be nondeterministic -- we need to explain that too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2955) Test code fails to compile with mvn compile without install
[ https://issues.apache.org/jira/browse/SPARK-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2955: -- Assignee: Sean Owen (was: Sean Owen) Test code fails to compile with mvn compile without install Key: SPARK-2955 URL: https://issues.apache.org/jira/browse/SPARK-2955 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.2 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Labels: build, compile, scalatest, test, test-compile Fix For: 1.2.0 (This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903 ) Right now, mvn compile test-compile fails to compile Spark. (Don't worry; mvn package works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven. It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86 This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example: {code} [error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest [error] class QueryTest extends PlanTest { [error] ^ [error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value [error] test(SPARK-1669: cacheTable should be idempotent) { [error] ^ ... {code} The issue I believe is that generation of a test-jar is bound here to the compile phase, but the test classes are not being compiled in this phase. It should bind to the test-compile phase. It works when executing mvn package or mvn install since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration. It would be nice for a simple mvn compile test-compile to work since the test code is perfectly compilable given the Maven declarations. On the plus side, this change is low-risk as it only affects tests. [~yhuai] made the original scalatest change and has glanced at this and thinks it makes sense. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2034) KafkaInputDStream doesn't close resources and may prevent JVM shutdown
[ https://issues.apache.org/jira/browse/SPARK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2034: -- Assignee: Sean Owen (was: Sean Owen) KafkaInputDStream doesn't close resources and may prevent JVM shutdown -- Key: SPARK-2034 URL: https://issues.apache.org/jira/browse/SPARK-2034 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.0.1, 1.1.0 Tobias noted today on the mailing list: {quote} I am trying to use Spark Streaming with Kafka, which works like a charm -- except for shutdown. When I run my program with sbt run-main, sbt will never exit, because there are two non-daemon threads left that don't die. I created a minimal example at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala. It starts a StreamingContext and does nothing more than connecting to a Kafka server and printing what it receives. Using the `future { ... }` construct, I shut down the StreamingContext after some seconds and then print the difference between the threads at start time and at end time. The output can be found at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1. There are a number of threads remaining that will prevent sbt from exiting. When I replace `KafkaUtils.createStream(...)` with a call that does exactly the same, except that it calls `consumerConnector.shutdown()` in `KafkaReceiver.onStop()` (which it should, IMO), the output is as shown at https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2. Does anyone have *any* idea what is going on here and why the program doesn't shut down properly? The behavior is the same with both kafka 0.8.0 and 0.8.1.1, by the way. {quote} Something similar was noted last year: http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3c1380220041.2428.yahoomail...@web160804.mail.bf1.yahoo.com%3E KafkaInputDStream doesn't close ConsumerConnector in onStop(), and does not close the Executor it creates. The latter leaves non-daemon threads and can prevent the JVM from shutting down even if streaming is closed properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1084) Fix most build warnings
[ https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1084: -- Assignee: Sean Owen (was: Sean Owen) Fix most build warnings --- Key: SPARK-1084 URL: https://issues.apache.org/jira/browse/SPARK-1084 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Labels: mvn, sbt, warning Fix For: 1.0.0 I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of the warnings that appear during build, so that developers don't become accustomed to them. The accompanying pull request contains a number of commits to quash most warnings observed through the mvn and sbt builds, although not all of them. FIXED! [WARNING] Parameter tasks is deprecated, use target instead Just a matter of updating tasks - target in inline Ant scripts. WARNING: -p has been deprecated and will be reused for a different (but still very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R. Goes away with updating scalatest plugin - 1.0-RC2 [WARNING] Note: /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java uses unchecked or unsafe operations. [WARNING] Note: Recompile with -Xlint:unchecked for details. Mostly @SuppressWarnings(unchecked) but needed a few more things to reveal the warning source: forktrue/fork (also needd for maxmem) and version 3.1 of the plugin. In a few cases some declaration changes were appropriate to avoid warnings. /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25: warning: Could not find any member to link for akka.actor.ActorSystem. /** ^ Getting several scaladoc errors like this and I'm not clear why it can't find the type -- outside its module? Remove the links as they're evidently not linking anyway? /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86: warning: Variable eval undefined in comment for class SparkIMain in class SparkIMain $ has to be escaped as \$ in scaladoc, apparently [WARNING] 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId' for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25 This one might need review. This is valid Maven syntax, but, Maven still warns on it. I wanted to see if we can do without it. These are trying to exclude: - org.codehaus.jackson - org.sonatype.sisu.inject - org.xerial.snappy org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. org.xerial.snappy is used by dependencies but the version seems to match anyway (1.0.5). org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming wants 1.9.11 directly. But the exclusion is in the wrong place if so, since Spark depends straight on Avro, which is what brings in 1.8.8, still. (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but the other Hadoop modules don't.) HBase depends on 1.8.8 but figured it was intentional to leave that as it would not collide with Spark streaming. (?) (I understand this varies by Hadoop version but confirmed this is all the same for 1.0.4, 0.23.7, 2.2.0.) NOT FIXED. [warn] /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) Not confident enough to fix this. [WARNING] there were 6 feature warning(s); re-run with -feature for details Don't know enough Scala to address these, yet. [WARNING] We have a duplicate org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar Probably addressable by being more careful about how binaries are packed though this appear to be ignorable; two identical copies of the class are colliding. [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile and [WARNING] JAR will be empty - no content was marked for inclusion! Apparently harmless warnings, but I don't know how to disable them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Updated] (SPARK-1663) Spark Streaming docs code has several small errors
[ https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1663: -- Assignee: Sean Owen (was: Sean Owen) Spark Streaming docs code has several small errors -- Key: SPARK-1663 URL: https://issues.apache.org/jira/browse/SPARK-1663 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Labels: streaming Fix For: 1.0.0 The changes are easiest to elaborate in the PR, which I will open shortly. Those changes raised a few little questions about the API too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build
[ https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1335: -- Assignee: Sean Owen (was: Sean Owen) Also increase perm gen / code cache for scalatest when invoked via Maven build -- Key: SPARK-1335 URL: https://issues.apache.org/jira/browse/SPARK-1335 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.0.0 I am observing build failures when the Maven build reaches tests in the new SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual complaint from scala, that it's out of permgen space, or that JIT out of code cache space. I see that various build scripts increase these both for SBT. This change simply adds these settings to scalatest's arguments. Works for me and seems a bit more consistent. (In the PR I'm going to tack on some other little changes too -- see PR.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel
[ https://issues.apache.org/jira/browse/SPARK-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2768: -- Assignee: Sean Owen (was: Sean Owen) Add product, user recommend method to MatrixFactorizationModel -- Key: SPARK-2768 URL: https://issues.apache.org/jira/browse/SPARK-2768 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 Right now, MatrixFactorizationModel can only predict a score for one or more (user,product) tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa -- users for a product). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log
[ https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2748: -- Assignee: Sean Owen (was: Sean Owen) Loss of precision for small arguments to Math.exp, Math.log --- Key: SPARK-2748 URL: https://issues.apache.org/jira/browse/SPARK-2748 Project: Spark Issue Type: Bug Components: GraphX, MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct answer is very near p. This is why Math.log1p exists. Similarly for one instance of exp(m) - 1 in GraphX; there's a special Math.expm1 method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I don't think there's a JIRA on that one, so maybe this can serve as an umbrella for all of these related issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)
[ https://issues.apache.org/jira/browse/SPARK-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1973: -- Assignee: Sean Owen (was: Sean Owen) Add randomSplit to JavaRDD (with tests, and tidy Java tests) Key: SPARK-1973 URL: https://issues.apache.org/jira/browse/SPARK-1973 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?) Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2745) Add Java friendly methods to Duration class
[ https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2745: -- Assignee: Sean Owen (was: Sean Owen) Add Java friendly methods to Duration class --- Key: SPARK-2745 URL: https://issues.apache.org/jira/browse/SPARK-2745 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Sean Owen Priority: Minor Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1209) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
[ https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1209: -- Assignee: Sean Owen (was: Sean Owen) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop -- Key: SPARK-1209 URL: https://issues.apache.org/jira/browse/SPARK-1209 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Sandy Ryza Assignee: Sean Owen Fix For: 1.2.0 It's private, so the change won't break compatibility -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1316) Remove use of Commons IO
[ https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1316: -- Assignee: Sean Owen (was: Sean Owen) Remove use of Commons IO Key: SPARK-1316 URL: https://issues.apache.org/jira/browse/SPARK-1316 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 ) Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark Utils.scala class. Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4170) Closure problems when running Scala app that extends App
[ https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-4170: -- Assignee: Sean Owen (was: Sean Owen) Closure problems when running Scala app that extends App -- Key: SPARK-4170 URL: https://issues.apache.org/jira/browse/SPARK-4170 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Michael Albert noted this problem on the mailing list (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html): {code} object DemoBug extends App { val conf = new SparkConf() val sc = new SparkContext(conf) val rdd = sc.parallelize(List(A,B,C,D)) val str1 = A val rslt1 = rdd.filter(x = { x != A }).count val rslt2 = rdd.filter(x = { str1 != null x != A }).count println(DemoBug: rslt1 = + rslt1 + rslt2 = + rslt2) } {code} This produces the output: {code} DemoBug: rslt1 = 3 rslt2 = 0 {code} If instead there is a proper main(), it works as expected. I also this week noticed that in a program which extends App, some values were inexplicably null in a closure. When changing to use main(), it was fine. I assume there is a problem with variables not being added to the closure when main() doesn't appear in the standard way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2602) sbt/sbt test steals window focus on OS X
[ https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2602: -- Assignee: Sean Owen (was: Sean Owen) sbt/sbt test steals window focus on OS X Key: SPARK-2602 URL: https://issues.apache.org/jira/browse/SPARK-2602 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 On OS X, I run {{sbt/sbt test}} from Terminal and then go off and do something else with my computer. It appears that there are several things in the test suite that launch Java programs that, for some reason, steal window focus. It can get very annoying, especially if you happen to be typing something in a different window, to be suddenly teleported to a random Java application and have your finely crafted keystrokes be sent where they weren't intended. It would be nice if {{sbt/sbt test}} didn't do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278614#comment-14278614 ] Apache Spark commented on SPARK-5012: - User 'FlytxtRnD' has created a pull request for this issue: https://github.com/apache/spark/pull/4059 Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5264) support `drop table` DDL command
[ https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278639#comment-14278639 ] Apache Spark commented on SPARK-5264: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/4060 support `drop table` DDL command - Key: SPARK-5264 URL: https://issues.apache.org/jira/browse/SPARK-5264 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 Original Estimate: 72h Remaining Estimate: 72h support `drop table` DDL command -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode
Lianhui Wang created SPARK-5266: --- Summary: numExecutorsFailed should exclude number of killExecutor in yarn mode Key: SPARK-5266 URL: https://issues.apache.org/jira/browse/SPARK-5266 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang when driver request killExecutor, am will kill container and numExecutorsFailed will increment. when numExecutorsFailed maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES reason. so numExecutorsFailed should exclude the killExecutor from driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang closed SPARK-5266. --- Resolution: Fixed numExecutorsFailed should exclude number of killExecutor in yarn mode - Key: SPARK-5266 URL: https://issues.apache.org/jira/browse/SPARK-5266 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang when driver request killExecutor, am will kill container and numExecutorsFailed will increment. when numExecutorsFailed maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES reason. so numExecutorsFailed should exclude the killExecutor from driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278684#comment-14278684 ] Apache Spark commented on SPARK-5266: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/4061 numExecutorsFailed should exclude number of killExecutor in yarn mode - Key: SPARK-5266 URL: https://issues.apache.org/jira/browse/SPARK-5266 Project: Spark Issue Type: Bug Components: YARN Reporter: Lianhui Wang when driver request killExecutor, am will kill container and numExecutorsFailed will increment. when numExecutorsFailed maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES reason. so numExecutorsFailed should exclude the killExecutor from driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot
[ https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278710#comment-14278710 ] Apache Spark commented on SPARK-4943: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/4062 Parsing error for query with table name having dot -- Key: SPARK-4943 URL: https://issues.apache.org/jira/browse/SPARK-4943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Alex Liu Fix For: 1.3.0, 1.2.1 When integrating Spark 1.2.0 with Cassandra SQL, the following query is broken. It was working for Spark 1.1.0 version. Basically we use the table name having dot to include database name {code} [info] java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but `.' found [info] [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT test2.a FROM sql_test.test2 AS test2 [info] ^ [info] at scala.sys.package$.error(package.scala:27) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174) [info] at org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) [info] at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) [info] at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) [info] at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) [info] at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) [info] at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) [info] at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83) [info] at scala.Option.getOrElse(Option.scala:120) [info] at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53) [info] at org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) [info] at
[jira] [Updated] (SPARK-5268) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent
[ https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nan Zhu updated SPARK-5268: --- Priority: Blocker (was: Major) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent Key: SPARK-5268 URL: https://issues.apache.org/jira/browse/SPARK-5268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Nan Zhu Priority: Blocker In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event... let's consider the following case The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted. This is not the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170 ] Muhammad-Ali A'rabi commented on SPARK-5226: This is DBSCAN algorithm: {noformat} DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') = MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {noformat} As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory. There are two phases of running: * Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel. * Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279648#comment-14279648 ] Apache Spark commented on SPARK-5193: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4065 Make Spark SQL API usable in Java and remove the Java-specific API -- Key: SPARK-5193 URL: https://issues.apache.org/jira/browse/SPARK-5193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Java version of the SchemaRDD API causes high maintenance burden for Spark SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it usable for Java, and then we can remove the Java specific version. Things to remove include (Java version of): - data type - Row - SQLContext - HiveContext Things to consider: - Scala and Java have a different collection library. - Scala and Java (8) have different closure interface. - Scala and Java can have duplicate definitions of common classes, such as BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org