[jira] [Updated] (SPARK-1996) Remove use of special Maven repo for Akka

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1996:
--
Assignee: Sean Owen  (was: Sean Owen)

 Remove use of special Maven repo for Akka
 -

 Key: SPARK-1996
 URL: https://issues.apache.org/jira/browse/SPARK-1996
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Matei Zaharia
Assignee: Sean Owen
 Fix For: 1.1.0


 According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html 
 Akka is now published to Maven Central, so our documentation and POM files 
 don't need to use the old Akka repo. It will be one less step for users to 
 worry about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1827:
--
Assignee: Sean Owen  (was: Sean Owen)

 LICENSE and NOTICE files need a refresh to contain transitive dependency info
 -

 Key: SPARK-1827
 URL: https://issues.apache.org/jira/browse/SPARK-1827
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 (Pardon marking it a blocker, but think it needs doing before 1.0 per chat 
 with [~pwendell])
 The LICENSE and NOTICE files need to cover all transitive dependencies, since 
 these are all distributed in the assembly jar. (c.f. 
 http://www.apache.org/dev/licensing-howto.html )
 I don't believe the current files cover everything. It's possible to 
 mostly-automatically generate these. I will generate this and propose a patch 
 to both today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-15 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278453#comment-14278453
 ] 

Guoqiang Li commented on SPARK-1405:


We can use the demo scripts in word2vec to get the same corpus. 
{code}
normalize_text() {
  awk '{print tolower($0);}' | sed -e s/’/'/g -e s/′/'/g -e s/''/ /g -e 
s/'/ ' /g -e s/“/\/g -e s/”/\/g \
  -e 's//  /g' -e 's/\./ \. /g' -e 's/br \// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9  
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text  news.2013.en.shuffled  data.txt
{code}

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1798) Tests should clean up temp files

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1798:
--
Assignee: Sean Owen  (was: Sean Owen)

 Tests should clean up temp files
 

 Key: SPARK-1798
 URL: https://issues.apache.org/jira/browse/SPARK-1798
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 Three issues related to temp files that tests generate -- these should be 
 touched up for hygiene but are not urgent.
 Modules have a log4j.properties which directs the unit-test.log output file 
 to a directory like [module]/target/unit-test.log. But this ends up creating 
 [module]/[module]/target/unit-test.log instead of former.
 The work/ directory is not deleted by mvn clean, in the parent and in 
 modules. Neither is the checkpoint/ directory created under the various 
 external modules.
 Many tests create a temp directory, which is not usually deleted. This can be 
 largely resolved by calling deleteOnExit() at creation and trying to call 
 Utils.deleteRecursively consistently to clean up, sometimes in an @After 
 method.
 (If anyone seconds the motion, I can create a more significant change that 
 introduces a new test trait along the lines of LocalSparkContext, which 
 provides management of temp directories for subclasses to take advantage of.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-3356:
--
Assignee: Sean Owen  (was: Sean Owen)

 Document when RDD elements' ordering within partitions is nondeterministic
 --

 Key: SPARK-3356
 URL: https://issues.apache.org/jira/browse/SPARK-3356
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Matei Zaharia
Assignee: Sean Owen
 Fix For: 1.2.0


 As reported in SPARK-3098 for example, for users using zipWithIndex, 
 zipWithUniqueId, etc, (and maybe even things like mapPartitions) it's 
 confusing that the order of elements in each partition after a shuffle 
 operation is nondeterministic (unless the operation was sortByKey). We should 
 explain this in the docs for the zip and partition-wise operations.
 Another subtle issue is that the order of values for each key in groupBy / 
 join / etc can be nondeterministic -- we need to explain that too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2955) Test code fails to compile with mvn compile without install

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2955:
--
Assignee: Sean Owen  (was: Sean Owen)

 Test code fails to compile with mvn compile without install 
 

 Key: SPARK-2955
 URL: https://issues.apache.org/jira/browse/SPARK-2955
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
  Labels: build, compile, scalatest, test, test-compile
 Fix For: 1.2.0


 (This is the corrected follow-up to 
 https://issues.apache.org/jira/browse/SPARK-2903 )
 Right now, mvn compile test-compile fails to compile Spark. (Don't worry; 
 mvn package works, so this is not major.) The issue stems from test code in 
 some modules depending on test code in other modules. That is perfectly fine 
 and supported by Maven.
 It takes extra work to get this to work with scalatest, and this has been 
 attempted: 
 https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
 This formulation is not quite enough, since the SQL Core module's tests fail 
 to compile for lack of finding test classes in SQL Catalyst, and likewise for 
 most Streaming integration modules depending on core Streaming test code. 
 Example:
 {code}
 [error] 
 /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23:
  not found: type PlanTest
 [error] class QueryTest extends PlanTest {
 [error] ^
 [error] 
 /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28:
  package org.apache.spark.sql.test is not a value
 [error]   test(SPARK-1669: cacheTable should be idempotent) {
 [error]   ^
 ...
 {code}
 The issue I believe is that generation of a test-jar is bound here to the 
 compile phase, but the test classes are not being compiled in this phase. It 
 should bind to the test-compile phase.
 It works when executing mvn package or mvn install since test-jar 
 artifacts are actually generated available through normal Maven mechanisms as 
 each module is built. They are then found normally, regardless of scalatest 
 configuration.
 It would be nice for a simple mvn compile test-compile to work since the 
 test code is perfectly compilable given the Maven declarations.
 On the plus side, this change is low-risk as it only affects tests.
 [~yhuai] made the original scalatest change and has glanced at this and 
 thinks it makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2034) KafkaInputDStream doesn't close resources and may prevent JVM shutdown

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2034:
--
Assignee: Sean Owen  (was: Sean Owen)

 KafkaInputDStream doesn't close resources and may prevent JVM shutdown
 --

 Key: SPARK-2034
 URL: https://issues.apache.org/jira/browse/SPARK-2034
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 1.0.1, 1.1.0


 Tobias noted today on the mailing list:
 {quote}
 I am trying to use Spark Streaming with Kafka, which works like a
 charm -- except for shutdown. When I run my program with sbt
 run-main, sbt will never exit, because there are two non-daemon
 threads left that don't die.
 I created a minimal example at
 https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-kafkadoesntshutdown-scala.
 It starts a StreamingContext and does nothing more than connecting to
 a Kafka server and printing what it receives. Using the `future { ...
 }` construct, I shut down the StreamingContext after some seconds and
 then print the difference between the threads at start time and at end
 time. The output can be found at
 https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output1.
 There are a number of threads remaining that will prevent sbt from
 exiting.
 When I replace `KafkaUtils.createStream(...)` with a call that does
 exactly the same, except that it calls `consumerConnector.shutdown()`
 in `KafkaReceiver.onStop()` (which it should, IMO), the output is as
 shown at 
 https://gist.github.com/tgpfeiffer/b1e765064e983449c6b6#file-output2.
 Does anyone have *any* idea what is going on here and why the program
 doesn't shut down properly? The behavior is the same with both kafka
 0.8.0 and 0.8.1.1, by the way.
 {quote}
 Something similar was noted last year:
 http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3c1380220041.2428.yahoomail...@web160804.mail.bf1.yahoo.com%3E
  
 KafkaInputDStream doesn't close ConsumerConnector in onStop(), and does not 
 close the Executor it creates. The latter leaves non-daemon threads and can 
 prevent the JVM from shutting down even if streaming is closed properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1084) Fix most build warnings

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1084:
--
Assignee: Sean Owen  (was: Sean Owen)

 Fix most build warnings
 ---

 Key: SPARK-1084
 URL: https://issues.apache.org/jira/browse/SPARK-1084
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
  Labels: mvn, sbt, warning
 Fix For: 1.0.0


 I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of 
 the warnings that appear during build, so that developers don't become 
 accustomed to them. The accompanying pull request contains a number of 
 commits to quash most warnings observed through the mvn and sbt builds, 
 although not all of them.
 FIXED!
 [WARNING] Parameter tasks is deprecated, use target instead
 Just a matter of updating tasks - target in inline Ant scripts.
 WARNING: -p has been deprecated and will be reused for a different (but still 
 very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R.
 Goes away with updating scalatest plugin - 1.0-RC2
 [WARNING] Note: 
 /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java
  uses unchecked or unsafe operations.
 [WARNING] Note: Recompile with -Xlint:unchecked for details.
 Mostly @SuppressWarnings(unchecked) but needed a few more things to reveal 
 the warning source: forktrue/fork (also needd for maxmem) and version 
 3.1 of the plugin. In a few cases some declaration changes were appropriate 
 to avoid warnings.
 /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25:
  warning: Could not find any member to link for akka.actor.ActorSystem.
 /**
 ^
 Getting several scaladoc errors like this and I'm not clear why it can't find 
 the type -- outside its module? Remove the links as they're evidently not 
 linking anyway?
 /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86:
  warning: Variable eval undefined in comment for class SparkIMain in class 
 SparkIMain
 $ has to be escaped as \$ in scaladoc, apparently
 [WARNING] 
 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId'
  for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a 
 valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, 
 /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25
 This one might need review.
 This is valid Maven syntax, but, Maven still warns on it. I wanted to see if 
 we can do without it. 
 These are trying to exclude:
 - org.codehaus.jackson
 - org.sonatype.sisu.inject
 - org.xerial.snappy
 org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. 
 org.xerial.snappy is used by dependencies but the version seems to match 
 anyway (1.0.5).
 org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming 
 wants 1.9.11 directly. But the exclusion is in the wrong place if so, since 
 Spark depends straight on Avro, which is what brings in 1.8.8, still. 
 (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but 
 the other Hadoop modules don't.)
 HBase depends on 1.8.8 but figured it was intentional to leave that as it 
 would not collide with Spark streaming. (?)
 (I understand this varies by Hadoop version but confirmed this is all the 
 same for 1.0.4, 0.23.7, 2.2.0.)
 NOT FIXED.
 [warn] 
 /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305:
  method connect in class IOManager is deprecated: use the new implementation 
 in package akka.io instead
 [warn]   override def preStart = IOManager(context.system).connect(new 
 InetSocketAddress(port))
 Not confident enough to fix this.
 [WARNING] there were 6 feature warning(s); re-run with -feature for details
 Don't know enough Scala to address these, yet.
 [WARNING] We have a duplicate 
 org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in 
 /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar
 Probably addressable by being more careful about how binaries are packed 
 though this appear to be ignorable; two identical copies of the class are 
 colliding.
 [WARNING] Zinc server is not available at port 3030 - reverting to normal 
 incremental compile
 and
 [WARNING] JAR will be empty - no content was marked for inclusion!
 Apparently harmless warnings, but I don't know how to disable them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Updated] (SPARK-1663) Spark Streaming docs code has several small errors

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1663:
--
Assignee: Sean Owen  (was: Sean Owen)

 Spark Streaming docs code has several small errors
 --

 Key: SPARK-1663
 URL: https://issues.apache.org/jira/browse/SPARK-1663
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
  Labels: streaming
 Fix For: 1.0.0


 The changes are easiest to elaborate in the PR, which I will open shortly.
 Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1335:
--
Assignee: Sean Owen  (was: Sean Owen)

 Also increase perm gen / code cache for scalatest when invoked via Maven build
 --

 Key: SPARK-1335
 URL: https://issues.apache.org/jira/browse/SPARK-1335
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 1.0.0


 I am observing build failures when the Maven build reaches tests in the new 
 SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual 
 complaint from scala, that it's out of permgen space, or that JIT out of code 
 cache space.
 I see that various build scripts increase these both for SBT. This change 
 simply adds these settings to scalatest's arguments. Works for me and seems a 
 bit more consistent.
 (In the PR I'm going to tack on some other little changes too -- see PR.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2768:
--
Assignee: Sean Owen  (was: Sean Owen)

 Add product, user recommend method to MatrixFactorizationModel
 --

 Key: SPARK-2768
 URL: https://issues.apache.org/jira/browse/SPARK-2768
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 Right now, MatrixFactorizationModel can only predict a score for one or more 
 (user,product) tuples. As a comment in the file notes, it would be more 
 useful to expose a recommend method, that computes top N scoring products for 
 a user (or vice versa -- users for a product).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2748:
--
Assignee: Sean Owen  (was: Sean Owen)

 Loss of precision for small arguments to Math.exp, Math.log
 ---

 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 In a few places in MLlib, an expression of the form log(1.0 + p) is 
 evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
 the correct answer is very near p. This is why Math.log1p exists.
 Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
 Math.expm1 method.
 While the errors occur only for very small arguments, given their use in 
 machine learning algorithms, this is entirely possible.
 Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
 don't think there's a JIRA on that one, so maybe this can serve as an 
 umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1973:
--
Assignee: Sean Owen  (was: Sean Owen)

 Add randomSplit to JavaRDD (with tests, and tidy Java tests)
 

 Key: SPARK-1973
 URL: https://issues.apache.org/jira/browse/SPARK-1973
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 I'd like to use randomSplit through the Java API, and would like to add a 
 convenience wrapper for this method to JavaRDD. This is fairly trivial. (In 
 fact, is the intent that JavaRDD not wrap every RDD method? and that 
 sometimes users should just use JavaRDD.wrapRDD()?)
 Along the way, I added tests for it, and also touched up the Java API test 
 style and behavior. This is maybe the more useful part of this small change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2745) Add Java friendly methods to Duration class

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2745:
--
Assignee: Sean Owen  (was: Sean Owen)

 Add Java friendly methods to Duration class
 ---

 Key: SPARK-2745
 URL: https://issues.apache.org/jira/browse/SPARK-2745
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1209) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1209:
--
Assignee: Sean Owen  (was: Sean Owen)

 SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
 --

 Key: SPARK-1209
 URL: https://issues.apache.org/jira/browse/SPARK-1209
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sandy Ryza
Assignee: Sean Owen
 Fix For: 1.2.0


 It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1316) Remove use of Commons IO

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1316:
--
Assignee: Sean Owen  (was: Sean Owen)

 Remove use of Commons IO
 

 Key: SPARK-1316
 URL: https://issues.apache.org/jira/browse/SPARK-1316
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (This follows from a side point on SPARK-1133, in discussion of the PR: 
 https://github.com/apache/spark/pull/164 )
 Commons IO is barely used in the project, and can easily be replaced with 
 equivalent calls to Guava or the existing Spark Utils.scala class.
 Removing a dependency feels good, and this one in particular can get a little 
 problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4170) Closure problems when running Scala app that extends App

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-4170:
--
Assignee: Sean Owen  (was: Sean Owen)

 Closure problems when running Scala app that extends App
 --

 Key: SPARK-4170
 URL: https://issues.apache.org/jira/browse/SPARK-4170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Michael Albert noted this problem on the mailing list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
 {code}
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 {code}
 This produces the output:
 {code}
 DemoBug: rslt1 = 3 rslt2 = 0
 {code}
 If instead there is a proper main(), it works as expected.
 I also this week noticed that in a program which extends App, some values 
 were inexplicably null in a closure. When changing to use main(), it was fine.
 I assume there is a problem with variables not being added to the closure 
 when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2602) sbt/sbt test steals window focus on OS X

2015-01-15 Thread Tony Stevenson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2602:
--
Assignee: Sean Owen  (was: Sean Owen)

 sbt/sbt test steals window focus on OS X
 

 Key: SPARK-2602
 URL: https://issues.apache.org/jira/browse/SPARK-2602
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 On OS X, I run {{sbt/sbt test}} from Terminal and then go off and do 
 something else with my computer. It appears that there are several things in 
 the test suite that launch Java programs that, for some reason, steal window 
 focus. 
 It can get very annoying, especially if you happen to be typing something in 
 a different window, to be suddenly teleported to a random Java application 
 and have your finely crafted keystrokes be sent where they weren't intended.
 It would be nice if {{sbt/sbt test}} didn't do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278614#comment-14278614
 ] 

Apache Spark commented on SPARK-5012:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/4059

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5264) support `drop table` DDL command

2015-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278639#comment-14278639
 ] 

Apache Spark commented on SPARK-5264:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4060

 support `drop table` DDL command 
 -

 Key: SPARK-5264
 URL: https://issues.apache.org/jira/browse/SPARK-5264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 support `drop table` DDL command 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-5266:
---

 Summary: numExecutorsFailed should exclude number of killExecutor 
in yarn mode
 Key: SPARK-5266
 URL: https://issues.apache.org/jira/browse/SPARK-5266
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang


when driver request killExecutor, am will kill container and numExecutorsFailed 
will increment. when numExecutorsFailed maxNumExecutorFailures in AM, AM will 
exit with EXIT_MAX_EXECUTOR_FAILURES reason. so numExecutorsFailed should 
exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang closed SPARK-5266.
---
Resolution: Fixed

 numExecutorsFailed should exclude number of killExecutor in yarn mode
 -

 Key: SPARK-5266
 URL: https://issues.apache.org/jira/browse/SPARK-5266
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang

 when driver request killExecutor, am will kill container and 
 numExecutorsFailed will increment. when numExecutorsFailed 
 maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES 
 reason. so numExecutorsFailed should exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278684#comment-14278684
 ] 

Apache Spark commented on SPARK-5266:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4061

 numExecutorsFailed should exclude number of killExecutor in yarn mode
 -

 Key: SPARK-5266
 URL: https://issues.apache.org/jira/browse/SPARK-5266
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang

 when driver request killExecutor, am will kill container and 
 numExecutorsFailed will increment. when numExecutorsFailed 
 maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES 
 reason. so numExecutorsFailed should exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot

2015-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278710#comment-14278710
 ] 

Apache Spark commented on SPARK-4943:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4062

 Parsing error for query with table name having dot
 --

 Key: SPARK-4943
 URL: https://issues.apache.org/jira/browse/SPARK-4943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Alex Liu
 Fix For: 1.3.0, 1.2.1


 When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
 broken. It was working for Spark 1.1.0 version. Basically we use the table 
 name having dot to include database name 
 {code}
 [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
 `.' found
 [info] 
 [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
 test2.a FROM sql_test.test2 AS test2
 [info] ^
 [info]   at scala.sys.package$.error(package.scala:27)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
 [info]   at 
 org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 [info]   at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 [info]   at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 [info]   at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
 [info]   at scala.Option.getOrElse(Option.scala:120)
 [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
 [info]   at 
 org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
 [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
 [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
 [info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
 [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
 [info]   at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683)
 [info]   at 
 

[jira] [Updated] (SPARK-5268) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-5268:
---
Priority: Blocker  (was: Major)

 CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent
 

 Key: SPARK-5268
 URL: https://issues.apache.org/jira/browse/SPARK-5268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Nan Zhu
Priority: Blocker

 In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
 backend actor and exit the program upon receive such event...
 let's consider the following case
 The user may develop an Akka-based program which starts the actor with 
 Spark's actor system and communicate with an external actor system (e.g. an 
 Akka-based receiver in spark streaming which communicates with an external 
 system)  If the external actor system fails or disassociates with the actor 
 within spark's system with purpose, we may receive DisassociatedEvent and the 
 executor is restarted.
 This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-15 Thread Muhammad-Ali A'rabi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts)  MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') = MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But 
I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279648#comment-14279648
 ] 

Apache Spark commented on SPARK-5193:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4065

 Make Spark SQL API usable in Java and remove the Java-specific API
 --

 Key: SPARK-5193
 URL: https://issues.apache.org/jira/browse/SPARK-5193
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 Java version of the SchemaRDD API causes high maintenance burden for Spark 
 SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
 both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
 usable for Java, and then we can remove the Java specific version. 
 Things to remove include (Java version of):
 - data type
 - Row
 - SQLContext
 - HiveContext
 Things to consider:
 - Scala and Java have a different collection library.
 - Scala and Java (8) have different closure interface.
 - Scala and Java can have duplicate definitions of common classes, such as 
 BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2