date:20151007

[jira] [Updated] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-07 Thread Monica Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Monica Liu updated SPARK-10981:
---
Description: 
I am using SparkR from RStudio, and I ran into an error with the join function 
that I recreated with a smaller example:

{code:title=joinTest.R|borderStyle=solid}
Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init("local[4]")
sqlContext <- sparkRSQL.init(sc) 

n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df1= createDataFrame(sqlContext, df)
showDF(df1)

x = c(2, 3, 10)
t = c("dd", "ee", "ff")
c = c(FALSE, FALSE, TRUE)
dff = data.frame(x, t, c)
df2 = createDataFrame(sqlContext, dff)
showDF(df2)
res = join(df1, df2, df1$n == df2$x, "semijoin")
showDF(res)
{code}

Running this code, I encountered the error:
{panel}
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
{panel}

However, if I changed the joinType to "leftsemi", 
{code}
res = join(df1, df2, df1$n == df2$x, "leftsemi")
{code}

I would get the error:
{panel}
Error in .local(x, y, ...) : 
  joinType must be one of the following types: 'inner', 'outer', 'left_outer', 
'right_outer', 'semijoin'
{panel}

Since the join function in R appears to invoke a Java method, I went into 
DataFrame.R and changed the code on line 1374 and line 1378 to change the 
"semijoin" to "leftsemi" to match the Java function's parameters. These also 
make the R joinType accepted values match those of Scala's. 

semijoin:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) 
{
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} 
else {
 stop("joinType must be one of the following types: ",
 "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
}
{code}

leftsemi:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) 
{
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} 
else {
 stop("joinType must be one of the following types: ",
 "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
}
{code}

This fixed the issue, but I'm not sure if this solution breaks hive 
compatibility or causes other issues, but I can submit a pull request to change 
this

  was:
I am using SparkR from RStudio, and I ran into an error with the join function 
that I recreated with a smaller example:

{code:title=joinTest.R|borderStyle=solid}
Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init("local[4]")
sqlContext <- sparkRSQL.init(sc) 

n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df1= createDataFrame(sqlContext, df)
showDF(df1)

x = c(2, 3, 10)
t = c("dd", "ee", "ff")
c = c(FALSE, FALSE, TRUE)
dff = data.frame(x, t, c)
df2 = createDataFrame(sqlContext, dff)
showDF(df2)
res = join(df1, df2, df1$n == df2$x, "semijoin")
showDF(res)
{code}

Running this code, I encountered the error:
{panel}
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
{panel}

However, if I changed the joinType to "leftsemi", 
{code}
res = join(df1, df2, df1$n == df2$x, "leftsemi")
{code}

I would get the error:
{panel}
Error in .local(x, y, ...) : 
  joinType must be one of the following types: 'inner', 'outer', 'left_outer', 
'right_outer', 'semijoin'
{panel}

Since the join function in R appears to invoke a Java method, I went into 
DataFrame.R and changed the code on line 1374 and line 1378 to change the 
"semijoin" to "leftsemi" to match the Java function's parameters. These also 
make the R joinType accepted values match those of Scala's. 

semijoin:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) 
{
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} 
else {
 stop("joinType must be one of the following types: ",
 "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
}
{code}

leftsemi:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) 
{
sdf <-

[jira] [Updated] (SPARK-10909) Spark sql jdbc fails for Oracle NUMBER type columns

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10909:

Labels:   (was: jdbc newbie sql)

> Spark sql jdbc fails for Oracle NUMBER type columns
> ---
>
> Key: SPARK-10909
> URL: https://issues.apache.org/jira/browse/SPARK-10909
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Dev
>Reporter: Kostas papageorgopoulos
>Priority: Minor
>
> When using spark sql to connect to Oracle and run a spark sql query i get the 
> following exception "requirement failed: Overflowed precision" This is 
> triggered when in the dbTable definition it is included an Oracle NUMBER 
> column
> {code}
>SQLContext sqlContext = new SQLContext(sc);
> Map options = new HashMap<>();
> options.put("driver", "oracle.jdbc.OracleDriver");
> options.put("user", "USER");
> options.put("password", "PASS");
> options.put("url", "ORACLE CONNECTINO URL");
> options.put("dbtable", "(select VARCHAR_COLUMN 
> ,TIMESTAMP_COLUMN,NUMBER_COLUMN from lsc_subscription_profiles)");
> DataFrame jdbcDF = 
> sqlContext.read().format("jdbc").options(options).load();
> jdbcDF.toJavaRDD().saveAsTextFile("hdfs://hdfshost:8020" + 
> "/path/to/write.bz2", BZip2Codec.class);
> {code}
> using driver 
> {code}
>  com.oracle
> ojdbc6
> 11.2.0.3.0
> 
> 
> {code}
> Using java sun java jdk 1.8.51 along with spring4
> The classpath of the junit run is 
> {code}
> /home/kostas/dev2/tools/jdk1.8.0_51/bin/java 
> -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:42901,suspend=y,server=n 
> -ea -Duser.timezone=Africa/Cairo -Dfile.encoding=UTF-8 -classpath 
>

[jira] [Updated] (SPARK-10956) Introduce common memory management interface for execution and storage

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10956:
--
Affects Version/s: (was: 1.0.0)

> Introduce common memory management interface for execution and storage
> --
>
> Key: SPARK-10956
> URL: https://issues.apache.org/jira/browse/SPARK-10956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> The first step towards implementing a solution for SPARK-1 is to refactor 
> the existing code to go through a common MemoryManager interface. *This issue 
> is concerned only with the introduction of this interface, preserving the 
> existing behavior as much as possible.* In the near future, we will implement 
> an alternate MemoryManager that shares memory between storage and execution 
> more efficiently.
> For a high level design doc, see SPARK-1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10941) .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

2015-10-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10941.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8973
[https://github.com/apache/spark/pull/8973]

> .Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve 
> code clarity
> --
>
> Key: SPARK-10941
> URL: https://issues.apache.org/jira/browse/SPARK-10941
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> Spark SQL's new AlgebraicAggregate interface is confusingly named.
> AlgebraicAggregate inherits from AggregateFunction2, adds a new set of 
> methods, then effectively bans the use of the inherited methods. This is 
> really confusing. I think that it's an anti-pattern / bad code smell if you 
> end up inheriting and wanting to remove methods inherited from the superclass.
> I think that we should re-name this class and should refactor the class 
> hierarchy so that there's a clear distinction between which parts of the code 
> work with imperative aggregate functions vs. expression-based aggregates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors

2015-10-07 Thread Monica Liu (JIRA)

Monica Liu created SPARK-10981:
--

 Summary: R semijoin leads to Java errors, R leftsemi leads to 
Spark errors
 Key: SPARK-10981
 URL: https://issues.apache.org/jira/browse/SPARK-10981
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.5.0
 Environment: SparkR from RStudio on Macbook
Reporter: Monica Liu
Priority: Minor


I am using SparkR from RStudio, and I ran into an error with the join function 
that I recreated with a smaller example:

{code:title=joinTest.R|borderStyle=solid}
Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
sc <- sparkR.init("local[4]")
sqlContext <- sparkRSQL.init(sc) 

n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
df1= createDataFrame(sqlContext, df)
showDF(df1)

x = c(2, 3, 10)
t = c("dd", "ee", "ff")
c = c(FALSE, FALSE, TRUE)
dff = data.frame(x, t, c)
df2 = createDataFrame(sqlContext, dff)
showDF(df2)
res = join(df1, df2, df1$n == df2$x, "semijoin")
showDF(res)
{code}

Running this code, I encountered the error:
{panel}
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. 
Supported join types include: 'inner', 'outer', 'full', 'fullouter', 
'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.
{panel}

However, if I changed the joinType to "leftsemi", 
{code}
res = join(df1, df2, df1$n == df2$x, "leftsemi")
{code}

I would get the error:
{panel}
Error in .local(x, y, ...) : 
  joinType must be one of the following types: 'inner', 'outer', 'left_outer', 
'right_outer', 'semijoin'
{panel}

Since the join function in R appears to invoke a Java method, I went into 
DataFrame.R and changed the code on line 1374 and line 1378 to change the 
"semijoin" to "leftsemi" to match the Java function's parameters. These also 
make the R joinType accepted values match those of Scala's. 

semijoin:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) 
{
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} 
else {
 stop("joinType must be one of the following types: ",
 "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
}
{code}

leftsemi:
{code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid}
if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) 
{
sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
} 
else {
 stop("joinType must be one of the following types: ",
 "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
}
{code}

This fixed the issue, but I'm not sure if this solution breaks hive 
compatibility or causes other issues, or if this issue is caused by a 
compatibility issue elsewhere. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1:
--
Description: 
Memory management in Spark is currently broken down into two disjoint regions: 
one for execution and one for storage. The sizes of these regions are 
statically configured and fixed for the duration of the application.

There are several limitations to this approach. It requires user expertise to 
avoid unnecessary spilling, and there are no sensible defaults that will work 
for all workloads. As a Spark user, I want Spark to manage the memory more 
intelligently so I do not need to worry about how to statically partition the 
execution (shuffle) memory fraction and cache memory fraction. More 
importantly, applications that do not use caching use only a small fraction of 
the heap space, resulting in suboptimal performance.

Instead, we should unify these two regions and let one borrow from another if 
possible.

  was:
Memory management in Spark is currently broken down into two disjoint regions: 
one for execution and one for storage. The sizes of these regions are 
statically configured and fixed for the duration of the application.

There are several limitations to this approach. It requires user expertise to 
avoid unnecessary spilling, and there are no sensible defaults that will work 
for all workloads. As a Spark user, I want Spark to manage the memory more 
intelligently so I do not need to worry about how to statically partition the 
execution (shuffle) memory fraction and cache memory fraction. Most 
importantly, applications that do not use caching use only a small fraction of 
the heap space, resulting in suboptimal performance.




> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>
> Memory management in Spark is currently broken down into two disjoint 
> regions: one for execution and one for storage. The sizes of these regions 
> are statically configured and fixed for the duration of the application.
> There are several limitations to this approach. It requires user expertise to 
> avoid unnecessary spilling, and there are no sensible defaults that will work 
> for all workloads. As a Spark user, I want Spark to manage the memory more 
> intelligently so I do not need to worry about how to statically partition the 
> execution (shuffle) memory fraction and cache memory fraction. More 
> importantly, applications that do not use caching use only a small fraction 
> of the heap space, resulting in suboptimal performance.
> Instead, we should unify these two regions and let one borrow from another if 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1:
--
Attachment: unified-memory-management-spark-1.pdf

> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: unified-memory-management-spark-1.pdf
>
>
> Memory management in Spark is currently broken down into two disjoint 
> regions: one for execution and one for storage. The sizes of these regions 
> are statically configured and fixed for the duration of the application.
> There are several limitations to this approach. It requires user expertise to 
> avoid unnecessary spilling, and there are no sensible defaults that will work 
> for all workloads. As a Spark user, I want Spark to manage the memory more 
> intelligently so I do not need to worry about how to statically partition the 
> execution (shuffle) memory fraction and cache memory fraction. More 
> importantly, applications that do not use caching use only a small fraction 
> of the heap space, resulting in suboptimal performance.
> Instead, we should unify these two regions and let one borrow from another if 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10679) javax.jdo.JDOFatalUserException in executor

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10679.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> javax.jdo.JDOFatalUserException in executor
> ---
>
> Key: SPARK-10679
> URL: https://issues.apache.org/jira/browse/SPARK-10679
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Navis
>Priority: Minor
> Fix For: 1.6.0
>
>
> HadoopRDD throws exception in executor, something like below.
> {noformat}
> 5/09/17 18:51:21 INFO metastore.HiveMetaStore: 0: Opening raw store with 
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/17 18:51:21 INFO metastore.ObjectStore: ObjectStore, initialize called
> 15/09/17 18:51:21 WARN metastore.HiveMetaStore: Retrying creating default 
> database after error: Class 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
> javax.jdo.JDOFatalUserException: Class 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:803)
>   at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureInputJobPropertiesForStorageHandler(PlanUtils.java:782)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:298)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$12.apply(TableReader.scala:274)
>   at 
>

[jira] [Updated] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 
Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */

{code}

  was:
Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */



> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947711#comment-14947711
 ] 

Reynold Xin commented on SPARK-10914:
-

I don't think size estimator would impact the result. 

If I understand this correctly, this fails with a small heap and compressed 
oops turned off? I can't reproduce it locally. I tried launching spark-shell 
using
{code}
bin/spark-shell --driver-java-options "-XX:-UseCompressedOops"
{code}


> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10982) Rename ExpressionAggregate -> DeclarativeAggregate

2015-10-07 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10982.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9013
[https://github.com/apache/spark/pull/9013]

> Rename ExpressionAggregate -> DeclarativeAggregate
> --
>
> Key: SPARK-10982
> URL: https://issues.apache.org/jira/browse/SPARK-10982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> Matches more closely with ImperativeAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10490) Consolidate the Cholesky solvers in WeightedLeastSquares and ALS

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10490:
--
Assignee: Yanbo Liang

> Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
> 
>
> Key: SPARK-10490
> URL: https://issues.apache.org/jira/browse/SPARK-10490
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> There are two Cholesky solvers in WeightedLeastSquares and ALS, we should 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10875:
--
Shepherd: Xiangrui Meng
Target Version/s: 1.6.0

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10987) yarn-cluster mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947788#comment-14947788
 ] 

Marcelo Vanzin commented on SPARK-10987:


Hmm. I think I know what's going on, just not how. In client mode, the AM 
relies on the driver disconnect event (ApplicationMaster.scala, 
{{AMEndpoint::onDisconnected}}) to shut itself down. But that event does not 
seem to be arriving for some reason, so the AM sticks around and doesn't die 
until many executors are launched, fail to connect to the driver, and finally 
the maximum number of executor failures is reached.

So just need to figure out why {{onDisconnected}} is not being called here.

> yarn-cluster mode misbehaving with netty-based RPC backend
> --
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2015-10-07 Thread Andrew Or (JIRA)

Andrew Or created SPARK-10985:
-

 Summary: Avoid passing evicted blocks throughout BlockManager / 
CacheManager
 Key: SPARK-10985
 URL: https://issues.apache.org/jira/browse/SPARK-10985
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Spark Core
Reporter: Andrew Or
Priority: Minor


This is a minor refactoring task.

Currently when we attempt to put a block in, we get back an array buffer of 
blocks that are dropped in the process. We do this to propagate these blocks 
back to our TaskContext, which will add them to its TaskMetrics so we can see 
them in the SparkUI storage tab properly.

Now that we have TaskContext.get, we can just use that to propagate this 
information. This simplifies a lot of the signatures and gets rid of weird 
return types like the following everywhere:
{code}
ArrayBuffer[(BlockId, BlockStatus)]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7869:
---
Assignee: Alexey Grishchenko

> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Assignee: Alexey Grishchenko
>Priority: Minor
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
> at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
> at 
> org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7869:
---
Target Version/s: 1.6.0

> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Assignee: Alexey Grishchenko
>Priority: Minor
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
> at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
> at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
> at 
> org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
> at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
> at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-07 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated SPARK-10986:
--
Description: 
When running an example task on a Mesos cluster (local master, local agent), 
any Spark tasks will stall with the following error (in the executor's stderr):
{code}
15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
port 53689.
15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
/10.0.79.8:53673
java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/10/07 15:21:14 ERROR NettyRpcHandler: org/apache/spark/rpc/netty/AskResponse
java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at

[jira] [Created] (SPARK-10986) ClassNotFoundException when running on Client mode, with a Mesos master.

2015-10-07 Thread Joseph Wu (JIRA)

Joseph Wu created SPARK-10986:
-

 Summary: ClassNotFoundException when running on Client mode, with 
a Mesos master.
 Key: SPARK-10986
 URL: https://issues.apache.org/jira/browse/SPARK-10986
 Project: Spark
  Issue Type: Bug
  Components: Mesos
 Environment: OSX, Java 8, Mesos 0.25.0

HEAD of Spark (`f5d154bc731aedfc2eecdb4ed6af8cac820511c9`)
Built from source:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Reporter: Joseph Wu


When running an example task on a Mesos cluster (local master, local agent), 
any Spark tasks will stall with the following error:
{code}
15/10/07 15:21:14 INFO Utils: Successfully started service 'sparkExecutor' on 
port 53689.
15/10/07 15:21:14 WARN TransportChannelHandler: Exception in connection from 
/10.0.79.8:53673
java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:227)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:265)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:226)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$3$$anon$4.onSuccess(NettyRpcEnv.scala:196)
at 
org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:152)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/10/07 15:21:14 ERROR NettyRpcHandler: org/apache/spark/rpc/netty/AskResponse
java.lang.ClassNotFoundException: org/apache/spark/rpc/netty/AskResponse
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at

[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-10-07 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947749#comment-14947749
 ] 

Huaxin Gao commented on SPARK-8386:
---

If the above fix is correct, can I have a pull request to check in the change?
I am new to spark and this is the very first gira I looked.  I am not so 
familiar with the process yet. 

I can only recreate the problem for 
testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
If it sets to true, it works OK for me.
Also, testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, 
TABLE_NAME, connectionProperties);
works OK for me. 


> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9702) Repartition operator should use Exchange to perform its shuffle

2015-10-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9702:

Assignee: Josh Rosen

> Repartition operator should use Exchange to perform its shuffle
> ---
>
> Key: SPARK-9702
> URL: https://issues.apache.org/jira/browse/SPARK-9702
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> Spark SQL's {{Repartition}} operator is implemented in terms of Spark Core's 
> repartition operator, which means that it has to perform lots of unnecessary 
> row copying and inefficient row serialization. Instead, it would be better if 
> this was implemented using some of Exchange's internals so that it can avoid 
> row format conversions and generic getters / hashcodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947530#comment-14947530
 ] 

Reynold Xin commented on SPARK-1:
-

[~bowenzhangusa]  thanks for the interest. This task is pretty significant and 
involves substantial refactoring to internals, so it might be pretty hard for 
somebody less familiar with Spark to just pick up. However, we will post a 
design doc soon and try to break this down into multiple tasks. Please follow 
the ticket and see if you can help contribute to some of them. Thanks!


> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>
> As a Spark user, I want Spark to manage the memory more intelligently so I do 
> not need to worry about how to statically partition the execution (shuffle) 
> memory fraction and cache memory fraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-10-07 Thread Bowen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947558#comment-14947558
 ] 

Bowen Zhang commented on SPARK-1:
-

[~rxin], sounds good.

> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>
> As a Spark user, I want Spark to manage the memory more intelligently so I do 
> not need to worry about how to statically partition the execution (shuffle) 
> memory fraction and cache memory fraction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10987) yarn-cluster mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947738#comment-14947738
 ] 

Marcelo Vanzin commented on SPARK-10987:


It may not be cluster mode per se; I ran the tests locally and the cluster mode 
test seems to not be running at all, because there's still a container from the 
previous test running, stuck at this place:

{noformat}
"main" prio=10 tid=0x7f1554017800 nid=0x3d2d waiting on condition 
[0x7f155d326000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xf9893fd0> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:99)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:162)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:149)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:250)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
{noformat}


> yarn-cluster mode misbehaving with netty-based RPC backend
> --
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10000) Consolidate storage and execution memory management

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1:
--
Summary: Consolidate storage and execution memory management  (was: 
Consolidate cache memory management and execution memory management)

> Consolidate storage and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
> Attachments: unified-memory-management-spark-1.pdf
>
>
> Memory management in Spark is currently broken down into two disjoint 
> regions: one for execution and one for storage. The sizes of these regions 
> are statically configured and fixed for the duration of the application.
> There are several limitations to this approach. It requires user expertise to 
> avoid unnecessary spilling, and there are no sensible defaults that will work 
> for all workloads. As a Spark user, I want Spark to manage the memory more 
> intelligently so I do not need to worry about how to statically partition the 
> execution (shuffle) memory fraction and cache memory fraction. More 
> importantly, applications that do not use caching use only a small fraction 
> of the heap space, resulting in suboptimal performance.
> Instead, we should unify these two regions and let one borrow from another if 
> possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10980:
---
Description: Decimal(100L, 20, 2) will become 
100 instead of 1.00  (was: Decimal(1, 20, 5) 
will become 1 instead of 0.1)

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Decimal(100L, 20, 2) will become 100 instead 
> of 1.00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)

2015-10-07 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10779.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8967
[https://github.com/apache/spark/pull/8967]

> Set initialModel for KMeans model in PySpark (spark.mllib)
> --
>
> Key: SPARK-10779
> URL: https://issues.apache.org/jira/browse/SPARK-10779
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
> Fix For: 1.6.0
>
>
> Provide initialModel param for pyspark.mllib.clustering.KMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10987) yarn-cluster mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947743#comment-14947743
 ] 

Marcelo Vanzin commented on SPARK-10987:


The {{ExecutorRunnable}} process (client-mode AM) was also running at the time, 
although I didn't notice anything interesting in the stack trace.

> yarn-cluster mode misbehaving with netty-based RPC backend
> --
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10738) Refactoring `Instance` out from LOR and LIR, and also cleaning up some code

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10738.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8853
[https://github.com/apache/spark/pull/8853]

> Refactoring `Instance` out from LOR and LIR, and also cleaning up some code
> ---
>
> Key: SPARK-10738
> URL: https://issues.apache.org/jira/browse/SPARK-10738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
> Fix For: 1.6.0
>
>
> Refactoring `Instance` case class out from LOR and LIR, and also cleaning up 
> some code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10917) Improve performance of complex types in columnar cache

2015-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10917.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8971
[https://github.com/apache/spark/pull/8971]

> Improve performance of complex types in columnar cache
> --
>
> Key: SPARK-10917
> URL: https://issues.apache.org/jira/browse/SPARK-10917
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> Complex types are really really slow in columnar cache, because of kryo 
> serializer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10987) yarn-client mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-10987:
---
Summary: yarn-client mode misbehaving with netty-based RPC backend  (was: 
yarn-cluster mode misbehaving with netty-based RPC backend)

> yarn-client mode misbehaving with netty-based RPC backend
> -
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10980:


Assignee: Davies Liu  (was: Apache Spark)

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Decimal(100L, 20, 2) will become 100 instead 
> of 1.00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947534#comment-14947534
 ] 

Apache Spark commented on SPARK-10980:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9014

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Decimal(100L, 20, 2) will become 100 instead 
> of 1.00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted

2015-10-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947533#comment-14947533
 ] 

Sean Owen commented on SPARK-10942:
---

I tried this on master in spark-shell:

{code}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable

val ssc = new StreamingContext(sc, Seconds(1))

val inputRDDs = mutable.Queue.tabulate(30) { i =>
  sc.parallelize(Seq(i))
}

val input = ssc.queueStream(inputRDDs)

val output = input.transform { rdd =>
  if (rdd.isEmpty()) {
rdd
  } else {
val rdd2 = rdd.map(identity)
rdd2.cache()
rdd2.setName(rdd.first().toString)
val rdd3 = rdd2.map(identity) ++ rdd2.map(identity)
rdd3
  }
}

output.print()
ssc.start()
{code}

I see nothing in the Storage tab after a short time, like ~30 seconds. The RDDs 
were cached but are then unpersisted.

> Not all cached RDDs are unpersisted
> ---
>
> Key: SPARK-10942
> URL: https://issues.apache.org/jira/browse/SPARK-10942
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Nick Pritchard
>Priority: Minor
> Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png
>
>
> I have a Spark Streaming application that caches RDDs inside of a 
> {{transform}} closure. Looking at the Spark UI, it seems that most of these 
> RDDs are unpersisted after the batch completes, but not all.
> I have copied a minimal reproducible example below to highlight the problem. 
> I run this and monitor the Spark UI "Storage" tab. The example generates and 
> caches 30 RDDs, and I see most get cleaned up. However in the end, some still 
> remain cached. There is some randomness going on because I see different RDDs 
> remain cached for each run.
> I have marked this as Major because I haven't been able to workaround it and 
> it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} 
> but that did not change anything.
> {code}
> val inputRDDs = mutable.Queue.tabulate(30) { i =>
>   sc.parallelize(Seq(i))
> }
> val input: DStream[Int] = ssc.queueStream(inputRDDs)
> val output = input.transform { rdd =>
>   if (rdd.isEmpty()) {
> rdd
>   } else {
> val rdd2 = rdd.map(identity)
> rdd2.setName(rdd.first().toString)
> rdd2.cache()
> val rdd3 = rdd2.map(identity)
> rdd3
>   }
> }
> output.print()
> ssc.start()
> ssc.awaitTermination()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10980:


Assignee: Apache Spark  (was: Davies Liu)

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Decimal(100L, 20, 2) will become 100 instead 
> of 1.00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10300) Use tags to control which tests to run depending on changes being tested

2015-10-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10300.

Resolution: Fixed

Second time is the charm?

> Use tags to control which tests to run depending on changes being tested
> 
>
> Key: SPARK-10300
> URL: https://issues.apache.org/jira/browse/SPARK-10300
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> Our unit tests are a little slow, and we could benefit from finer-grained 
> control over which test suites to run depending on what parts of the code 
> base is changed.
> Currently we already have some logic in "run-tests.py" to do this, but it's 
> limited; for example, a minor change in an untracked module is mapped to a 
> "root" module change, and causes really expensive Hive compatibility tests to 
> run when that may not really be necessary.
> Using tags could allow us to be smarter here; this is an idea that has been 
> thrown around before (e.g. SPARK-4746). On top of that, for the cases when we 
> actually do need to run all the tests, we should bump the existing timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10000) Consolidate cache memory management and execution memory management

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-1:
--
Description: 
Memory management in Spark is currently broken down into two disjoint regions: 
one for execution and one for storage. The sizes of these regions are 
statically configured and fixed for the duration of the application.

There are several limitations to this approach. It requires user expertise to 
avoid unnecessary spilling, and there are no sensible defaults that will work 
for all workloads. As a Spark user, I want Spark to manage the memory more 
intelligently so I do not need to worry about how to statically partition the 
execution (shuffle) memory fraction and cache memory fraction. Most 
importantly, applications that do not use caching use only a small fraction of 
the heap space, resulting in suboptimal performance.



  was:
As a Spark user, I want Spark to manage the memory more intelligently so I do 
not need to worry about how to statically partition the execution (shuffle) 
memory fraction and cache memory fraction.



> Consolidate cache memory management and execution memory management
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Story
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>
> Memory management in Spark is currently broken down into two disjoint 
> regions: one for execution and one for storage. The sizes of these regions 
> are statically configured and fixed for the duration of the application.
> There are several limitations to this approach. It requires user expertise to 
> avoid unnecessary spilling, and there are no sensible defaults that will work 
> for all workloads. As a Spark user, I want Spark to manage the memory more 
> intelligently so I do not need to worry about how to statically partition the 
> execution (shuffle) memory fraction and cache memory fraction. Most 
> importantly, applications that do not use caching use only a small fraction 
> of the heap space, resulting in suboptimal performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945358#comment-14945358
 ] 

Reynold Xin edited comment on SPARK-10914 at 10/7/15 10:22 PM:
---

Thanks for looking into it.  I have narrowed it down a lot now.  It depends on 
the --executor-memory setting!

For me using "bin/spark-shell" locally I don't see the problem, but I do see it 
when I use a standalone cluster. It reliably reproduces whenever I specify 
"--executor-memory 32g" or greater, but if I leave executor-memory unset or 
specify a value of 31g or less I get the correct result.

Here's a correct run:
{code}
spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 31g 

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
res0: Long = 2 
{code}

Here's an incorrect run, with an explain plan (the explain is the same in any 
case):
{code}
spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 32g

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").explain()
== Physical Plan ==
BroadcastHashJoin [xx#0], [yy#2], BuildRight
 Union
  TungstenProject [1 AS xx#0]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#1]
   Scan OneRowRelation[]
 Union
  TungstenProject [1 AS yy#2]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#3]
   Scan OneRowRelation[]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
res1: Long = 0   
{code}

I have two machines in my cluster:
- one is Ubuntu 12.04, running the spark-master node
- one is Ubuntu 14.04, running the spark-slave node.

Both have 256Gb RAM.

JVM on both machines is: oracle-java7-installer from PPA:
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)




was (Author: benm):
Thanks for looking into it.  I have narrowed it down a lot now.  It depends on 
the --executor-memory setting!

For me using "bin/spark-shell" locally I don't see the problem, but I do see it 
when I use a standalone cluster. It reliably reproduces whenever I specify 
"--executor-memory 32g" or greater, but if I leave executor-memory unset or 
specify a value of 31g or less I get the correct result.

Here's a correct run:
spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 31g 

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
res0: Long = 2 


Here's an incorrect run, with an explain plan (the explain is the same in any 
case):

spark@spark-master:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master 
spark://spark-master:7077   --executor-memory 32g

scala> val x = sql("select 1 xx union all select 2")
x: org.apache.spark.sql.DataFrame = [xx: int]

scala> val y = sql("select 1 yy union all select 2")
y: org.apache.spark.sql.DataFrame = [yy: int]

scala> x.join(y, $"xx" === $"yy").explain()
== Physical Plan ==
BroadcastHashJoin [xx#0], [yy#2], BuildRight
 Union
  TungstenProject [1 AS xx#0]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#1]
   Scan OneRowRelation[]
 Union
  TungstenProject [1 AS yy#2]
   Scan OneRowRelation[]
  TungstenProject [2 AS _c0#3]
   Scan OneRowRelation[]

scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
res1: Long = 0   


I have two machines in my cluster:
- one is Ubuntu 12.04, running the spark-master node
- one is Ubuntu 14.04, running the spark-slave node.

Both have 256Gb RAM.

JVM on both machines is: oracle-java7-installer from PPA:
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)



> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from

[jira] [Resolved] (SPARK-10490) Consolidate the Cholesky solvers in WeightedLeastSquares and ALS

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10490.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8936
[https://github.com/apache/spark/pull/8936]

> Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
> 
>
> Key: SPARK-10490
> URL: https://issues.apache.org/jira/browse/SPARK-10490
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
> Fix For: 1.6.0
>
>
> There are two Cholesky solvers in WeightedLeastSquares and ALS, we should 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10980) Create wrong decimal with unscaled value and precision > 18

2015-10-07 Thread Davies Liu (JIRA)

Davies Liu created SPARK-10980:
--

 Summary: Create wrong decimal with unscaled value and precision > 
18
 Key: SPARK-10980
 URL: https://issues.apache.org/jira/browse/SPARK-10980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.6.0
Reporter: Davies Liu
Assignee: Davies Liu


Decimal(1, 20, 5) will become 1 instead of 0.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10980) Create wrong decimal with unscaled value and precision > 18

2015-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10980:
---
Affects Version/s: 1.2.2
   1.3.1
   1.4.1

> Create wrong decimal with unscaled value and precision > 18
> ---
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Decimal(1, 20, 5) will become 1 instead of 0.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-10980:
---
Summary: Create wrong decimal if unscaled > 1e18 and scale > 0  (was: 
Create wrong decimal with unscaled value and precision > 18)

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Decimal(1, 20, 5) will become 1 instead of 0.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10342) Cooperative memory management

2015-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947687#comment-14947687
 ] 

Reynold Xin commented on SPARK-10342:
-

[~fxing] thanks a lot for the interest. Since this is your first time 
contributing to Spark, it'd be better to start with some simpler tasks. This 
task itself is fairly complicated and would require deep understanding of the 
internals to do. I'd recommend searching for some starter tasks and bug fixes 
to get yourself warmed up first, and then move towards the more challenging 
ones.


> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-10-07 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947736#comment-14947736
 ] 

Huaxin Gao commented on SPARK-8386:
---

I looked the code, it has this
  @deprecated("Use write.jdbc()", "1.4.0")
  def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit = {
val w = if (overwrite) write.mode(SaveMode.Overwrite) else write
w.jdbc(url, table, new Properties)
  }
if overwrite is false, it doesn't set and the default is SaveMode.ErrorIfExists.
It seems to me that if the overwrite is false, the mode should be set to be 
Append

  def insertIntoJDBC(url: String, table: String, overwrite: Boolean): Unit = {
val w = if (overwrite) write.mode(SaveMode.Overwrite) else 
write.mode(SaveMode.Append)
w.jdbc(url, table, new Properties)
  }



> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10959:
--
Assignee: Bryan Cutler

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10980) Create wrong decimal if unscaled > 1e18 and scale > 0

2015-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-10980.

   Resolution: Fixed
Fix Version/s: 1.2.3
   1.5.2
   1.3.2
   1.4.2
   1.6.0

Issue resolved by pull request 9014
[https://github.com/apache/spark/pull/9014]

> Create wrong decimal if unscaled > 1e18 and scale > 0
> -
>
> Key: SPARK-10980
> URL: https://issues.apache.org/jira/browse/SPARK-10980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0, 1.4.2, 1.3.2, 1.5.2, 1.2.3
>
>
> Decimal(100L, 20, 2) will become 100 instead 
> of 1.00



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10982) Rename ExpressionAggregate -> DeclarativeAggregate

2015-10-07 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-10982:
---

 Summary: Rename ExpressionAggregate -> DeclarativeAggregate
 Key: SPARK-10982
 URL: https://issues.apache.org/jira/browse/SPARK-10982
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Matches more closely with ImperativeAggregate.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10956) Introduce common memory management interface for execution and storage

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10956:
--
Description: 
The first step towards implementing a solution for SPARK-1 is to refactor 
the existing code to go through a common MemoryManager interface. *This issue 
is concerned only with the introduction of this interface, preserving the 
existing behavior as much as possible.* In the near future, we will implement 
an alternate MemoryManager that shares memory between storage and execution 
more efficiently.

For a high level design doc, see SPARK-1.

  was:
The first step towards implementing a solution for SPARK-1 is to refactor 
the existing code to go through a common MemoryManager interface. *This issue 
is concerned only with the introduction of this interface, preserving the 
existing behavior as much as possible.* In the near future, we will implement 
an alternate MemoryManager that shares memory between storage and execution 
more efficiently.

A higher level design doc will be posted shortly on an upcoming issue.


> Introduce common memory management interface for execution and storage
> --
>
> Key: SPARK-10956
> URL: https://issues.apache.org/jira/browse/SPARK-10956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> The first step towards implementing a solution for SPARK-1 is to refactor 
> the existing code to go through a common MemoryManager interface. *This issue 
> is concerned only with the introduction of this interface, preserving the 
> existing behavior as much as possible.* In the near future, we will implement 
> an alternate MemoryManager that shares memory between storage and execution 
> more efficiently.
> For a high level design doc, see SPARK-1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10956) Introduce common memory management interface for execution and storage

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10956:
--
Description: 
The first step towards implementing a solution for SPARK-1 is to refactor 
the existing code to go through a common MemoryManager interface. *This issue 
is concerned only with the introduction of this interface, preserving the 
existing behavior as much as possible.* In the near future, we will implement 
an alternate MemoryManager that shares memory between storage and execution 
more efficiently.

A higher level design doc will be posted shortly on an upcoming issue.

  was:
Memory management in Spark is currently broken down into two disjoint regions: 
one for execution and one for storage. The sizes of these regions are 
statically configured and fixed for the duration of the application.

There are several limitations to this approach. It requires user expertise to 
avoid unnecessary spilling, and there are no sensible defaults that will work 
for all workloads. Most importantly, applications that do not use caching use 
only a small fraction of the heap space, resulting in suboptimal performance.

The first step towards implementing a solution is to refactor the existing code 
to go through a common MemoryManager interface. *This issue is concerned only 
with the introduction of this interface, preserving the existing behavior as 
much as possible.* In the near future, we will implement an alternate 
MemoryManager that shares memory between storage and execution more efficiently.

A higher level design doc will be posted shortly on an upcoming issue.


> Introduce common memory management interface for execution and storage
> --
>
> Key: SPARK-10956
> URL: https://issues.apache.org/jira/browse/SPARK-10956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> The first step towards implementing a solution for SPARK-1 is to refactor 
> the existing code to go through a common MemoryManager interface. *This issue 
> is concerned only with the introduction of this interface, preserving the 
> existing behavior as much as possible.* In the near future, we will implement 
> an alternate MemoryManager that shares memory between storage and execution 
> more efficiently.
> A higher level design doc will be posted shortly on an upcoming issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10984) Simplify *MemoryManager class structure

2015-10-07 Thread Andrew Or (JIRA)

Andrew Or created SPARK-10984:
-

 Summary: Simplify *MemoryManager class structure
 Key: SPARK-10984
 URL: https://issues.apache.org/jira/browse/SPARK-10984
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or


This is a refactoring task.

After SPARK-10956 gets merged, we will have the following:
- MemoryManager
- StaticMemoryManager
- ExecutorMemoryManager
- TaskMemoryManager
- ShuffleMemoryManager

This is pretty confusing. The goal is to merge ShuffleMemoryManager and 
ExecutorMemoryManager and move them into the top-level MemoryManager abstract 
class. Then TaskMemoryManager should be renamed something else and used by 
MemoryManager, such that the new hierarchy becomes:

- MemoryManager
- StaticMemoryManager



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10956) Introduce common memory management interface for execution and storage

2015-10-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10956:
--
Priority: Major  (was: Critical)

> Introduce common memory management interface for execution and storage
> --
>
> Key: SPARK-10956
> URL: https://issues.apache.org/jira/browse/SPARK-10956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> The first step towards implementing a solution for SPARK-1 is to refactor 
> the existing code to go through a common MemoryManager interface. *This issue 
> is concerned only with the introduction of this interface, preserving the 
> existing behavior as much as possible.* In the near future, we will implement 
> an alternate MemoryManager that shares memory between storage and execution 
> more efficiently.
> For a high level design doc, see SPARK-1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)

2015-10-07 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10779:
--
Assignee: Evan Chen

> Set initialModel for KMeans model in PySpark (spark.mllib)
> --
>
> Key: SPARK-10779
> URL: https://issues.apache.org/jira/browse/SPARK-10779
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Evan Chen
> Fix For: 1.6.0
>
>
> Provide initialModel param for pyspark.mllib.clustering.KMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14945441#comment-14945441
 ] 

Reynold Xin edited comment on SPARK-10914 at 10/7/15 10:12 PM:
---

I just ran with 
{code}
--executor-memory 100g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
{code}

but the problem persists.  In the worker log it shows:

{code}
15/10/06 18:36:36 INFO ExecutorRunner: Launch command: 
"/usr/lib/jvm/java-7-oracle/jre/bin/java" "-cp" 
"/home/spark/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar"
 "-Xms102400M" "-Xmx102400M" "-Dspark.driver.port=53169" 
"-XX:-UseCompressedOops" "-XX:MaxPermSize=256m" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"akka.tcp://sparkDriver@10.122.82.99:53169/user/CoarseGrainedScheduler" 
"--executor-id" "0" "--hostname" "10.122.82.99" "--cores" "20" "--app-id" 
"app-20151006183636-0019" "--worker-url" 
"akka.tcp://sparkWorker@10.122.82.99:51402/user/Worker"
{code}


was (Author: benm):
I just ran with 
--executor-memory 100g --conf 
"spark.executor.extraJavaOptions=-XX:-UseCompressedOops"

but the problem persists.  In the worker log it shows:


15/10/06 18:36:36 INFO ExecutorRunner: Launch command: 
"/usr/lib/jvm/java-7-oracle/jre/bin/java" "-cp" 
"/home/spark/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/home/spark/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar"
 "-Xms102400M" "-Xmx102400M" "-Dspark.driver.port=53169" 
"-XX:-UseCompressedOops" "-XX:MaxPermSize=256m" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"akka.tcp://sparkDriver@10.122.82.99:53169/user/CoarseGrainedScheduler" 
"--executor-id" "0" "--hostname" "10.122.82.99" "--cores" "20" "--app-id" 
"app-20151006183636-0019" "--worker-url" 
"akka.tcp://sparkWorker@10.122.82.99:51402/user/Worker"


> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10914:

Description: 
Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
{code}

  was:
Using an inner join, to match together two integer columns, I generally get no 
results when there should be matches.  But the results vary and depend on 
whether the dataframes are coming from SQL, JSON, or cached, as well as the 
order in which I cache things and query them.

This minimal example reproduces it consistently for me in the spark-shell, on 
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
http://spark.apache.org/downloads.html.)

{code}
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2") 
val y = sql("select 1 yy union all select 2")

x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */

/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */

{code}


> Incorrect empty join sets when executor-memory >= 32g
> -
>
> Key: SPARK-10914
> URL: https://issues.apache.org/jira/browse/SPARK-10914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
>Reporter: Ben Moran
>
> Using an inner join, to match together two integer columns, I generally get 
> no results when there should be matches.  But the results vary and depend on 
> whether the dataframes are coming from SQL, JSON, or cached, as well as the 
> order in which I cache things and query them.
> This minimal example reproduces it consistently for me in the spark-shell, on 
> new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from 
> http://spark.apache.org/downloads.html.)
> {code}
> /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
> val x = sql("select 1 xx union all select 2") 
> val y = sql("select 1 yy union all select 2")
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
> /* If I cache both tables it works: */
> x.cache()
> y.cache()
> x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
> /* but this still doesn't work: */
> x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10987) yarn-cluster mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-10987:
--

 Summary: yarn-cluster mode misbehaving with netty-based RPC backend
 Key: SPARK-10987
 URL: https://issues.apache.org/jira/browse/SPARK-10987
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin
Priority: Blocker


YARN running in cluster deploy mode seems to be having issues with the new RPC 
backend; if you look at unit test runs, tests that run in cluster mode are 
taking several minutes to run, instead of the more usual 20-30 seconds.

For example, 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:

{noformat}
[info] YarnClusterSuite:
[info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
[info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
[info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
[info] - run Python application in yarn-client mode (21 seconds, 842 
milliseconds)
[info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
[info] - user class path first in client mode (1 minute, 58 seconds)
[info] - user class path first in cluster mode (4 minutes, 49 seconds)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10919) Association rules class should return the support of each rule

2015-10-07 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10919:
--
Summary: Association rules class should return the support of each rule  
(was: Assosiation rules class should return the support of each rule)

> Association rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on top of the AR. In order to return the 
> frequency (support) of each rule,   freqUnion: Double, and  freqAntecedent: 
> Double should be:  val freqUnion: Double, val freqAntecedent: Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10919) Assosiation rules class should return the support of each rule

2015-10-07 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947720#comment-14947720
 ] 

Joseph K. Bradley commented on SPARK-10919:
---

Would you be interested in sending a PR for this?

> Assosiation rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on top of the AR. In order to return the 
> frequency (support) of each rule,   freqUnion: Double, and  freqAntecedent: 
> Double should be:  val freqUnion: Double, val freqAntecedent: Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9702) Repartition operator should use Exchange to perform its shuffle

2015-10-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-9702.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8083
[https://github.com/apache/spark/pull/8083]

> Repartition operator should use Exchange to perform its shuffle
> ---
>
> Key: SPARK-9702
> URL: https://issues.apache.org/jira/browse/SPARK-9702
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
> Fix For: 1.6.0
>
>
> Spark SQL's {{Repartition}} operator is implemented in terms of Spark Core's 
> repartition operator, which means that it has to perform lots of unnecessary 
> row copying and inefficient row serialization. Instead, it would be better if 
> this was implemented using some of Exchange's internals so that it can avoid 
> row format conversions and generic getters / hashcodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10875:
--
Assignee: Nick Pritchard

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Assignee: Nick Pritchard
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10982) Rename ExpressionAggregate -> DeclarativeAggregate

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947526#comment-14947526
 ] 

Apache Spark commented on SPARK-10982:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9013

> Rename ExpressionAggregate -> DeclarativeAggregate
> --
>
> Key: SPARK-10982
> URL: https://issues.apache.org/jira/browse/SPARK-10982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Matches more closely with ImperativeAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10982) Rename ExpressionAggregate -> DeclarativeAggregate

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10982:


Assignee: Reynold Xin  (was: Apache Spark)

> Rename ExpressionAggregate -> DeclarativeAggregate
> --
>
> Key: SPARK-10982
> URL: https://issues.apache.org/jira/browse/SPARK-10982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Matches more closely with ImperativeAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10982) Rename ExpressionAggregate -> DeclarativeAggregate

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10982:


Assignee: Apache Spark  (was: Reynold Xin)

> Rename ExpressionAggregate -> DeclarativeAggregate
> --
>
> Key: SPARK-10982
> URL: https://issues.apache.org/jira/browse/SPARK-10982
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Matches more closely with ImperativeAggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10983) Implement unified memory manager

2015-10-07 Thread Andrew Or (JIRA)

Andrew Or created SPARK-10983:
-

 Summary: Implement unified memory manager
 Key: SPARK-10983
 URL: https://issues.apache.org/jira/browse/SPARK-10983
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Andrew Or
Priority: Critical


This builds on top of the MemoryManager interface introduced in SPARK-10956. 
That issue implemented a StaticMemoryManager which implemented legacy behavior. 
This issue is concerned with implementing a UnifiedMemoryManager (or whatever 
we call it) according to the design doc posted in SPARK-1.

Note: the scope of this issue is limited to implementing this new mode without 
significant refactoring. If necessary, any such refactoring should come later 
(or earlier) in a separate issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10856) SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of TIMESTAMP

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10856.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 1.6.0

> SQL Server dialect needs to map java.sql.Timestamp to DATETIME instead of 
> TIMESTAMP
> ---
>
> Key: SPARK-10856
> URL: https://issues.apache.org/jira/browse/SPARK-10856
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Henrik Behrens
>Assignee: Liang-Chi Hsieh
>  Labels: patch
> Fix For: 1.6.0
>
>
> When saving a DataFrame to MS SQL Server, en error is thrown if there is more 
> than one TIMESTAMP column:
> df.printSchema
> root
>  |-- Id: string (nullable = false)
>  |-- TypeInformation_CreatedBy: string (nullable = false)
>  |-- TypeInformation_ModifiedBy: string (nullable = true)
>  |-- TypeInformation_TypeStatus: integer (nullable = false)
>  |-- TypeInformation_CreatedAtDatabase: timestamp (nullable = false)
>  |-- TypeInformation_ModifiedAtDatabase: timestamp (nullable = true)
> df.write.mode("overwrite").jdbc(url, tablename, props)
> com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one 
> timestamp column. Because table 'DebtorTypeSet1' already has one, the column 
> 'TypeInformation_ModifiedAtDatabase' cannot be added.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError
> (SQLServerException.java:217)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServ
> erStatement.java:1635)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePrep
> aredStatement(SQLServerPreparedStatement.java:426)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecC
> md.doExecute(SQLServerPreparedStatement.java:372)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:6276)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLSe
> rverConnection.java:1793)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLSer
> verStatement.java:184)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLS
> erverStatement.java:159)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeUpdate
> (SQLServerPreparedStatement.java:315)
> I tested this on Windows and SQL Server 12 using Spark 1.4.1.
> I think this can be fixed in a similar way to Spark-10419.
> As a refererence, here is the type mapping according to the SQL Server JDBC 
> driver (basicDT.java, extracted from sqljdbc_4.2.6420.100_enu.exe):
>private static void displayRow(String title, ResultSet rs) {
>   try {
>  System.out.println(title);
>  System.out.println(rs.getInt(1) + " , " +// SQL integer 
> type.
>rs.getString(2) + " , " +  // SQL char 
> type.
>rs.getString(3) + " , " +  // SQL varchar 
> type.
>rs.getBoolean(4) + " , " + // SQL bit type.
>rs.getDouble(5) + " , " +  // SQL decimal 
> type.
>rs.getDouble(6) + " , " +  // SQL money 
> type.
>rs.getTimestamp(7) + " , " +   // SQL datetime 
> type.
>rs.getDate(8) + " , " +// SQL date 
> type.
>rs.getTime(9) + " , " +// SQL time 
> type.
>rs.getTimestamp(10) + " , " +  // SQL 
> datetime2 type.
>((SQLServerResultSet)rs).getDateTimeOffset(11)); // SQL 
> datetimeoffset type. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7869:
---
Description: 
Most of our tables load into dataframes just fine with postgres. However we 
have a number of tables leveraging the JSONB datatype. Spark will error and 
refuse to load this table. While asking for Spark to support JSONB might be a 
tall order in the short term, it would be great if Spark would at least load 
the table ignoring the columns it can't load or have it be an option.
{code}
pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")

Py4JJavaError: An error occurred while calling o41.load.
: java.sql.SQLException: Unsupported type 
at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
at 
org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Most of our tables load into dataframes just fine with postgres. However we 
have a number of tables leveraging the JSONB datatype. Spark will error and 
refuse to load this table. While asking for Spark to support JSONB might be a 
tall order in the short term, it would be great if Spark would at least load 
the table ignoring the columns it can't load or have it be an option.

pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")

Py4JJavaError: An error occurred while calling o41.load.
: java.sql.SQLException: Unsupported type 
at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112)
at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133)
at 
org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)


> Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
> --
>
> Key: SPARK-7869
> URL: https://issues.apache.org/jira/browse/SPARK-7869
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.3.1
> Environment: Spark 1.3.1
>Reporter: Brad Willard
>Priority: Minor
>
> Most of our tables load into dataframes just fine with postgres. However we 
> have a number of tables leveraging the JSONB datatype. Spark will error and 
> refuse to load this table. While asking for Spark to support JSONB might be a 
> tall order in the short term, it would be great if Spark would at least load 
> the table ignoring the columns it can't load or have it be an option.
> {code}
> pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json")
> Py4JJavaError: An error occurred while calling o41.load.
> : java.sql.SQLException: Unsupported type 
> at

[jira] [Updated] (SPARK-10186) Add support for more postgres column types

2015-10-07 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10186:

Labels:   (was: array json postgres sql struct)

> Add support for more postgres column types
> --
>
> Key: SPARK-10186
> URL: https://issues.apache.org/jira/browse/SPARK-10186
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>
> The specific observations below are based on Postgres 9.4 tables accessed via 
> the postgresql-9.4-1201.jdbc41.jar driver. However, based on the behavior, I 
> would expect the problem to exists for all external SQL databases.
> - *json and jsonb columns generate {{java.sql.SQLException: Unsupported type 
> }}*. While it is reasonable to not support dynamic schema discovery of 
> JSON columns automatically (it requires two passes over the data), a better 
> behavior would be to create a String column and return the JSON.
> - *Array columns generate {{java.sql.SQLException: Unsupported type 2003}}*. 
> This is true even for simple types, e.g., {{text[]}}. A better behavior would 
> be be create an Array column. 
> - *Custom type columns are mapped to a String column.* This behavior is 
> harder to understand as the schema of a custom type is fixed and therefore 
> mappable to a Struct column. The automatic conversion to a string is also 
> inconsistent when compared to json and array column handling.
> The exceptions are thrown by 
> {{org.apache.spark.sql.jdbc.JDBCRDD$.org$apache$spark$sql$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:100)}}
>  so this definitely looks like a Spark SQL and not a JDBC problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10959:
--
Target Version/s: 1.5.2, 1.6.0

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10959) PySpark StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

2015-10-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10959:
--
Shepherd: Xiangrui Meng

> PySpark StreamingLogisticRegressionWithSGD does not train with given regParam 
> and convergenceTol parameters
> ---
>
> Key: SPARK-10959
> URL: https://issues.apache.org/jira/browse/SPARK-10959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Critical
>
> These parameters are passed into the StreamingLogisticRegressionWithSGD 
> constructor, but do not get transferred to the model to use when training.  
> Same problem with StreamingLinearRegressionWithSGD and the intercept param is 
> in the wrong  argument place where it is being used as regularization value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-10-07 Thread Huaxin Gao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947775#comment-14947775
 ] 

Huaxin Gao commented on SPARK-8386:
---

Actually I can also recreate the problem in the other two cases.  The reason I 
didn't recreate it earlier is that I already fixed the tableExists method in my 
code.  In tableExists, it checks if table exists using SELECT 1 FROM $table 
LIMIT 1. This is not working for all databases.  For the database that doesn't 
support LIMIT 1, it will return false and the jdbc/insertIntoJDBC will try to 
create table again and will get table already exists error. I think this is 
what Visha got. 

I searched Gira, and there is already a problem opened for the LIMIT 1 in 
tableexists, so I will not fix this problem.  I will only fix the saveMode 
problem in my first comment. 


 

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-10988:
--

 Summary: Reduce duplication in Aggregate2's expression rewriting 
logic
 Key: SPARK-10988
 URL: https://issues.apache.org/jira/browse/SPARK-10988
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


In `aggregate/utils.scala`, there is a substantial amount of duplication in the 
expression-rewriting logic. As a prerequisite to supporting imperative 
aggregate functions in `TungstenAggregate`, we should refactor this file so 
that the same expression-rewriting logic is used for both `SortAggregate` and 
`TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10967:
---
Attachment: CreateDF_sparkshell_jira.scala

run in spark shell

> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1
> Environment: RHEL
>Reporter: RaviShankar KS
>Assignee: Josh Rosen
>  Labels: sql, union
> Fix For: 1.5.0
>
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --   d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col1: string (nullable = true)
>  |||-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col1: string (nullable = false)
>  ||-- col2: string (nullable = false)
> --d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col2: string (nullable = true)
>  |||-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col2: string (nullable = false)
>  ||-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array> and 
> array>).;
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct and struct).;
> // Code to be used in spark shell to create the data frames is attached.
> -
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10968) Incorrect Join behavior in filter conditions

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10968:
---
Attachment: CreateDF_sparkshell_jira.scala

> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10968
> URL: https://issues.apache.org/jira/browse/SPARK-10968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: RHEL, spark-shell
>Reporter: RaviShankar KS
>  Labels: DataFramejoin, sql,
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --   d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col1: string (nullable = true)
>  |||-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col1: string (nullable = false)
>  ||-- col2: string (nullable = false)
> --d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col2: string (nullable = true)
>  |||-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col2: string (nullable = false)
>  ||-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array> and 
> array>).;
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct and struct).;
> // Code to be used in spark shell to create the data frames is attached.
> -
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10965) Optimize filesEqualRecursive

2015-10-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946469#comment-14946469
 ] 

Sean Owen commented on SPARK-10965:
---

You don't need it to be assigned to you, just go ahead. I will add you as a 
"Contributor" here which should grant that permission anyway.
Would you store the checksum? it still entails reading the whole file to 
compute it. The way this method is written, it wouldn't help as it's only 
looking at each file once.
Files.equal is at least comparing by blocks of bytes, not by byte.

> Optimize filesEqualRecursive
> 
>
> Key: SPARK-10965
> URL: https://issues.apache.org/jira/browse/SPARK-10965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Mark Grover
>Priority: Minor
>
> When we try to download dependencies, if there is a file at the destination 
> already, we compare if the files are equal (recursively, if they are 
> directories). For files, we compare their bytes. Now, these dependencies can 
> be jars and be really large and byte-by-byte comparisons can super slow.
> I think it'd be better to do a checksum.
> Here's the code in question:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L500



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10967:
---
Description: 
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.
As long as leaf columns have the same name under a nested column, should order 
matter ??

Consider below example for two data frames d5 and d5_opp : 
d5 and d5_opp have a nested field 'value', but their inner leaf columns do not 
have the same ordering. 

--   d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col1: string (nullable = true)
 |||-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col1: string (nullable = false)
 ||-- col2: string (nullable = false)

--d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col2: string (nullable = true)
 |||-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col2: string (nullable = false)
 ||-- col1: string (nullable = false)

The below join statement do not work in spark 1.5, and raises exception. In 
spark 1.4, no exception is raised, but join result is incorrect :

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value", 
 "inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array> and 
array>).;

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
$"d5_opp.value1",  "inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct and struct).;

// Code to be used in spark shell to create the data frames is attached.
-
The only work-around is to explode the conditions for every leaf field. 
In our case, we are generating the conditions and dataframes programmatically, 
and exploding the conditions for every leaf field is additional overhead, and 
may not be always possible.

  was:
We notice that the join conditions are not working as expected in the case of 
nested columns being compared.
As long as leaf columns have the same name under a nested column, should order 
matter ??

Consider below example for two data frames d5 and d5_opp : 
d5 and d5_opp have a nested field 'value', but their inner leaf columns do not 
have the same ordering. 

--   d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col1: string (nullable = true)
 |||-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col1: string (nullable = false)
 ||-- col2: string (nullable = false)

--d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col2: string (nullable = true)
 |||-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col2: string (nullable = false)
 ||-- col1: string (nullable = false)

The below join statement do not work in spark 1.5, and raises exception. In 
spark 1.4, no exception is raised, but join result is incorrect :

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value", 
 "inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array> and 
array>).;

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
$"d5_opp.value1",  "inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct and struct).;

-
The only work-around is to explode the conditions for every leaf field. 
In our case, we are generating the conditions and dataframes programmatically, 
and exploding the conditions for every leaf field is additional overhead, and 
may not be always possible.


> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>

[jira] [Closed] (SPARK-10967) ignore - Incorrect UNION ALL behavior

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS closed SPARK-10967.
--
  Resolution: Invalid
Target Version/s: 1.5.1, 1.4.1  (was: 1.4.1, 1.5.1)

> ignore - Incorrect UNION ALL behavior
> -
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1
> Environment: RHEL
>Reporter: RaviShankar KS
>Assignee: Josh Rosen
>  Labels: sql, union
> Fix For: 1.5.0
>
>
> IGNORE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10967) Incorrect UNION ALL behavior

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10967:
---
Attachment: (was: CreateDF_sparkshell_jira.scala)

> Incorrect UNION ALL behavior
> 
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1
> Environment: RHEL
>Reporter: RaviShankar KS
>Assignee: Josh Rosen
>  Labels: sql, union
> Fix For: 1.5.0
>
>
> IGNORE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10967) Incorrect UNION ALL behavior

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10967:
---
Description: IGNORE  (was: We notice that the join conditions are not 
working as expected in the case of nested columns being compared.
As long as leaf columns have the same name under a nested column, should order 
matter ??

Consider below example for two data frames d5 and d5_opp : 
d5 and d5_opp have a nested field 'value', but their inner leaf columns do not 
have the same ordering. 

--   d5.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col1: string (nullable = true)
 |||-- col2: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col1: string (nullable = false)
 ||-- col2: string (nullable = false)

--d5_opp.printSchema
root
 |-- key: integer (nullable = false)
 |-- value: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- col2: string (nullable = true)
 |||-- col1: string (nullable = true)
 |-- value1: struct (nullable = false)
 ||-- col2: string (nullable = false)
 ||-- col1: string (nullable = false)

The below join statement do not work in spark 1.5, and raises exception. In 
spark 1.4, no exception is raised, but join result is incorrect :

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === $"d5_opp.value", 
 "inner").show
Exception raised is :  
org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due to 
data type mismatch: differing types in '(value = value)' 
(array> and 
array>).;

--d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
$"d5_opp.value1",  "inner").show
Exception raised is :
org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' due 
to data type mismatch: differing types in '(value1 = value1)' 
(struct and struct).;

// Code to be used in spark shell to create the data frames is attached.
-
The only work-around is to explode the conditions for every leaf field. 
In our case, we are generating the conditions and dataframes programmatically, 
and exploding the conditions for every leaf field is additional overhead, and 
may not be always possible.)

> Incorrect UNION ALL behavior
> 
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1
> Environment: RHEL
>Reporter: RaviShankar KS
>Assignee: Josh Rosen
>  Labels: sql, union
> Fix For: 1.5.0
>
>
> IGNORE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10967) ignore - Incorrect UNION ALL behavior

2015-10-07 Thread RaviShankar KS (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RaviShankar KS updated SPARK-10967:
---
Summary: ignore - Incorrect UNION ALL behavior  (was: Incorrect UNION ALL 
behavior)

> ignore - Incorrect UNION ALL behavior
> -
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1
> Environment: RHEL
>Reporter: RaviShankar KS
>Assignee: Josh Rosen
>  Labels: sql, union
> Fix For: 1.5.0
>
>
> IGNORE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10968) Incorrect Join behavior in filter conditions

2015-10-07 Thread RaviShankar KS (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946493#comment-14946493
 ] 

RaviShankar KS commented on SPARK-10968:


not actually incorrect.
DataFrame d5.value has fields col1 and col2, in that order
DataFrame d5_opp.value has fields col2 and col1, in that order

If I add a condition "d5.value === d5_opp.value" to the join, then it should 
replicate the exact case as in :
d5.value.col1 === d5_opp.value.col1 && d5.value.col2 === d5_opp.value.col2

The leaf fields have the same data types in both DataFrames.
I believe order should not matter here.


> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10968
> URL: https://issues.apache.org/jira/browse/SPARK-10968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: RHEL, spark-shell
>Reporter: RaviShankar KS
>  Labels: DataFramejoin, sql,
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --   d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col1: string (nullable = true)
>  |||-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col1: string (nullable = false)
>  ||-- col2: string (nullable = false)
> --d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col2: string (nullable = true)
>  |||-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col2: string (nullable = false)
>  ||-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array> and 
> array>).;
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct and struct).;
> // Code to be used in spark shell to create the data frames is attached.
> -
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10968) Incorrect Join behavior in filter conditions

2015-10-07 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14946358#comment-14946358
 ] 

Liang-Chi Hsieh commented on SPARK-10968:
-

Is it incorrect? Because d5.value and d5_opp.value are actually different 
types, as the exception suggests.

> Incorrect Join behavior in filter conditions
> 
>
> Key: SPARK-10968
> URL: https://issues.apache.org/jira/browse/SPARK-10968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: RHEL, spark-shell
>Reporter: RaviShankar KS
>  Labels: DataFramejoin, sql,
> Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --   d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col1: string (nullable = true)
>  |||-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col1: string (nullable = false)
>  ||-- col2: string (nullable = false)
> --d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- col2: string (nullable = true)
>  |||-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  ||-- col2: string (nullable = false)
>  ||-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array> and 
> array>).;
> --d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct and struct).;
> // Code to be used in spark shell to create the data frames is attached.
> -
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10988:


Assignee: Apache Spark  (was: Josh Rosen)

> Reduce duplication in Aggregate2's expression rewriting logic
> -
>
> Key: SPARK-10988
> URL: https://issues.apache.org/jira/browse/SPARK-10988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In `aggregate/utils.scala`, there is a substantial amount of duplication in 
> the expression-rewriting logic. As a prerequisite to supporting imperative 
> aggregate functions in `TungstenAggregate`, we should refactor this file so 
> that the same expression-rewriting logic is used for both `SortAggregate` and 
> `TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10064) Decision tree continuous feature binning is slow in large feature spaces

2015-10-07 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10064.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8246
[https://github.com/apache/spark/pull/8246]

> Decision tree continuous feature binning is slow in large feature spaces
> 
>
> Key: SPARK-10064
> URL: https://issues.apache.org/jira/browse/SPARK-10064
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Nathan Howell
>Assignee: Nathan Howell
>Priority: Minor
> Fix For: 1.6.0
>
>
> When working with large feature spaces and high bin counts (>500) the binning 
> process can take many hours. This is particularly painful because it ties up 
> executors for the duration, which is not shared-cluster friendly.
> The binning process can and should be performed on the executors instead of 
> the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10992:


Assignee: Apache Spark

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947912#comment-14947912
 ] 

Apache Spark commented on SPARK-10992:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7788

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10992) Partial Aggregation Support for Hive UDAF

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10992:


Assignee: (was: Apache Spark)

> Partial Aggregation Support for Hive UDAF
> -
>
> Key: SPARK-10992
> URL: https://issues.apache.org/jira/browse/SPARK-10992
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10997) Netty-based RPC env should support a "client-only" mode.

2015-10-07 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-10997:
--

 Summary: Netty-based RPC env should support a "client-only" mode.
 Key: SPARK-10997
 URL: https://issues.apache.org/jira/browse/SPARK-10997
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin


The new netty RPC still behaves too much like akka; it requires both client 
(e.g. an executor) and server (e.g. the driver) to listen for incoming 
connections.

That is not necessary, since sockets are full-duplex and RPCs should be able to 
flow either way on any connection. Also, because the semantics of the 
netty-based RPC don't exactly match akka, you get weird issues like SPARK-10987.

Supporting a client-only mode also reduces the number of ports Spark apps need 
to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8386) DataFrame and JDBC regression

2015-10-07 Thread Peter Haumer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947815#comment-14947815
 ] 

Peter Haumer commented on SPARK-8386:
-

Huaxin Gao, sorry for not replying earlier. It slipped through the cracks. 

I had stopped using the Spark jdbc framework completely because of this bug and 
implemented my own as I need to support DB2, Derby, SQL Server, as well as 
Oracle and ran into this issue with SQL Server and DB2. DB2's Limit is quite 
different: 
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0059212.html?lang=en.
 

However, I ended up using the jdbc meta-data, which in fact does not perform 
too well on all DBMS, though; and it might be better to provide different 
queries for each DBMS here:

final ResultSet resultSet = connection.getMetaData().getTables(null,schemaName, 
tableName, null);
if (resultSet.next()) {
 return true;
}

> DataFrame and JDBC regression
> -
>
> Key: SPARK-8386
> URL: https://issues.apache.org/jira/browse/SPARK-8386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: RHEL 7.1
>Reporter: Peter Haumer
>Priority: Critical
>
> I have an ETL app that appends to a JDBC table new results found at each run. 
>  In 1.3.1 I did this:
> testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false);
> When I do this now in 1.4 it complains that the "object" 'TABLE_NAME' already 
> exists. I get this even if I switch the overwrite to true.  I also tried this 
> now:
> testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, 
> connectionProperties);
> getting the same error. It works running the first time creating the new 
> table and adding data successfully. But, running it a second time it (the 
> jdbc driver) will tell me that the table already exists. Even 
> SaveMode.Overwrite will give me the same error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947817#comment-14947817
 ] 

Apache Spark commented on SPARK-10988:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9015

> Reduce duplication in Aggregate2's expression rewriting logic
> -
>
> Key: SPARK-10988
> URL: https://issues.apache.org/jira/browse/SPARK-10988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In `aggregate/utils.scala`, there is a substantial amount of duplication in 
> the expression-rewriting logic. As a prerequisite to supporting imperative 
> aggregate functions in `TungstenAggregate`, we should refactor this file so 
> that the same expression-rewriting logic is used for both `SortAggregate` and 
> `TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10988:


Assignee: Josh Rosen  (was: Apache Spark)

> Reduce duplication in Aggregate2's expression rewriting logic
> -
>
> Key: SPARK-10988
> URL: https://issues.apache.org/jira/browse/SPARK-10988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In `aggregate/utils.scala`, there is a substantial amount of duplication in 
> the expression-rewriting logic. As a prerequisite to supporting imperative 
> aggregate functions in `TungstenAggregate`, we should refactor this file so 
> that the same expression-rewriting logic is used for both `SortAggregate` and 
> `TungstenAggregate`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10940) Too many open files Spark Shuffle

2015-10-07 Thread Sandeep Pal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947836#comment-14947836
 ] 

Sandeep Pal commented on SPARK-10940:
-

I rebooted all the machines and issue is not reproduced. Closing the issue.

> Too many open files Spark Shuffle
> -
>
> Key: SPARK-10940
> URL: https://issues.apache.org/jira/browse/SPARK-10940
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 1.5.0
> Environment: 6 node standalone spark cluster with 1 master and 5 
> worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and 
> 36 cores.
>Reporter: Sandeep Pal
>
> Executing terasort by Spark-SQL on the data generated by teragen in hadoop. 
> Data size generated is ~456 GB. 
> Terasort passing with --total-executor-cores = 40, where as failing for 
> --total-executor-cores = 120. 
> I have tried to increase the ulimit to 10k but the problem persists.
> Note: The above failed configuration of 120 cores worked on spark core code 
> on the top of rdd. The failure is only in case of using Spark SQL.
> Below is the error message from one of the executor node:
> java.io.FileNotFoundException: 
> /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90
>  (Too many open files)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10990) Avoid the serialization multiple times during unrolling of complex types

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947838#comment-14947838
 ] 

Apache Spark commented on SPARK-10990:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9016

> Avoid the serialization multiple times during unrolling of complex types
> 
>
> Key: SPARK-10990
> URL: https://issues.apache.org/jira/browse/SPARK-10990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The serialize() will be called by actualSize() and append(), we should use 
> UnsafeProjection before unrolling, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10990) Avoid the serialization multiple times during unrolling of complex types

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10990:


Assignee: Davies Liu  (was: Apache Spark)

> Avoid the serialization multiple times during unrolling of complex types
> 
>
> Key: SPARK-10990
> URL: https://issues.apache.org/jira/browse/SPARK-10990
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The serialize() will be called by actualSize() and append(), we should use 
> UnsafeProjection before unrolling, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10990) Avoid the serialization multiple times during unrolling of complex types

2015-10-07 Thread Davies Liu (JIRA)

Davies Liu created SPARK-10990:
--

 Summary: Avoid the serialization multiple times during unrolling 
of complex types
 Key: SPARK-10990
 URL: https://issues.apache.org/jira/browse/SPARK-10990
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


The serialize() will be called by actualSize() and append(), we should use 
UnsafeProjection before unrolling, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10987) yarn-client mode misbehaving with netty-based RPC backend

2015-10-07 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947893#comment-14947893
 ] 

Marcelo Vanzin commented on SPARK-10987:


Anyway, here's what I found so far.

Driver launches AM; AM connects back to driver and sends stuff. But driver 
never sends any messages to AM. That means that in 
{{NettyRpcHandler::connectionTerminated}}, the {{Disassociated}} message is not 
sent, because since no message was sent *to* to the AM, the code in 
{{NettyRpcHandler::receive}} was never run, so the driver connection was never 
recorded.

So there must be a way for {{NettyRpcHandler}} to know when outgoing 
connections are killed, not just incoming ones.

In a way this is caused by the code trying to mimic what akka does, but failing 
at it; since the AM is purely a client, it shouldn't need to listen for 
connections and rely on incoming connections for anything - it should be able 
to register itself and do everything using the client socket it opened. That's 
probably going to be tricky to fix, though.

> yarn-client mode misbehaving with netty-based RPC backend
> -
>
> Key: SPARK-10987
> URL: https://issues.apache.org/jira/browse/SPARK-10987
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> YARN running in cluster deploy mode seems to be having issues with the new 
> RPC backend; if you look at unit test runs, tests that run in cluster mode 
> are taking several minutes to run, instead of the more usual 20-30 seconds.
> For example, 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull:
> {noformat}
> [info] YarnClusterSuite:
> [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds)
> [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds)
> [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds)
> [info] - run Python application in yarn-client mode (21 seconds, 842 
> milliseconds)
> [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds)
> [info] - user class path first in client mode (1 minute, 58 seconds)
> [info] - user class path first in cluster mode (4 minutes, 49 seconds)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10989) Add the dot and hadamard products to the Vectors object

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10989:


Assignee: Apache Spark

> Add the dot and hadamard products to the Vectors object
> ---
>
> Key: SPARK-10989
> URL: https://issues.apache.org/jira/browse/SPARK-10989
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Stephen Tridgell
>Assignee: Apache Spark
>Priority: Minor
>  Labels: dot-product, hadamard-product, mllib
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Creating an issue to add some more functionality to Vectors for convenience
> I have implemented the dot and hadamard products 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10994) Local clustering coefficient computation in GraphX

2015-10-07 Thread Yang Yang (JIRA)

Yang Yang created SPARK-10994:
-

 Summary: Local clustering coefficient computation in GraphX
 Key: SPARK-10994
 URL: https://issues.apache.org/jira/browse/SPARK-10994
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Yang Yang


We propose to implement an algorithm to compute the local clustering 
coefficient in GraphX. The local clustering coefficient of a vertex (node) in a 
graph quantifies how close its neighbors are to being a clique (complete 
graph). More specifically, the local clustering coefficient C_i for a vertex 
v_i is given by the proportion of links between the vertices within its 
neighbourhood divided by the number of links that could possibly exist between 
them. Duncan J. Watts and Steven Strogatz introduced the measure in 1998 to 
determine whether a graph is a small-world network. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10989) Add the dot and hadamard products to the Vectors object

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10989:


Assignee: (was: Apache Spark)

> Add the dot and hadamard products to the Vectors object
> ---
>
> Key: SPARK-10989
> URL: https://issues.apache.org/jira/browse/SPARK-10989
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Stephen Tridgell
>Priority: Minor
>  Labels: dot-product, hadamard-product, mllib
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Creating an issue to add some more functionality to Vectors for convenience
> I have implemented the dot and hadamard products 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10998) Show non-children in default Expression.toString

2015-10-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948007#comment-14948007
 ] 

Apache Spark commented on SPARK-10998:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/9022

> Show non-children in default Expression.toString
> 
>
> Key: SPARK-10998
> URL: https://issues.apache.org/jira/browse/SPARK-10998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10998) Show non-children in default Expression.toString

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10998:


Assignee: Michael Armbrust  (was: Apache Spark)

> Show non-children in default Expression.toString
> 
>
> Key: SPARK-10998
> URL: https://issues.apache.org/jira/browse/SPARK-10998
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10767) Make pyspark shared params codegen more consistent

2015-10-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10767:


Assignee: (was: Apache Spark)

> Make pyspark shared params codegen more consistent 
> ---
>
> Key: SPARK-10767
> URL: https://issues.apache.org/jira/browse/SPARK-10767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> Namely "." shows up in some places in the template when using the param 
> docstring and not in others



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 185 matches

Mail list logo