[jira] [Created] (SPARK-10846) Stray META-INF in directory spark-shell is launched from causes problems

2015-09-27 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-10846:
-

 Summary: Stray META-INF in directory spark-shell is launched from 
causes problems
 Key: SPARK-10846
 URL: https://issues.apache.org/jira/browse/SPARK-10846
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.5.0
Reporter: Ryan Williams
Priority: Minor


I observed some perplexing errors while running {{$SPARK_HOME/bin/spark-shell}} 
yesterday (with {{$SPARK_HOME}} pointing at a clean 1.5.0 install):

{code}
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider 
org.apache.hadoop.fs.s3.S3FileSystem not found
{code}

while initializing {{HiveContext}}; full example output is 
[here|https://gist.github.com/ryan-williams/34210ad640687113e5c3#file-1-5-0-failure].

The issue was that a stray {{META-INF}} directory from some other project I'd 
built months ago was sitting in the directory that I'd run {{spark-shell}} from 
(*not* in my {{$SPARK_HOME}}, just in the directory I happened to be in when I 
ran {{$SPARK_HOME/bin/spark-shell}}). 

That {{META-INF}} had a {{services/org.apache.hadoop.fs.FileSystem}} file 
specifying some provider classes ({{S3FileSystem}} in the example above) that 
were unsurprisingly not resolvable by Spark.

I'm not sure if this is purely my fault for attempting to run Spark from a 
directory with another project's config files laying around, but I find it 
somewhat surprising that, given a {{$SPARK_HOME}} pointing to a clean Spark 
install, that {{$SPARK_HOME/bin/spark-shell}} picks up detritus from the 
{{cwd}} it is called from, so I wanted to at least document it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909764#comment-14909764
 ] 

Ondřej Smola commented on SPARK-8734:
-

Problem is limitation of java Properties format - common workaround is using 
using comma as separator - i saw it so many times,
i think we can link 
[Properties.html#load(java.io.Reader)|http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)]
 in documentation.
Are you ok with it?

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10741) Hive Query Having/OrderBy against Parquet table is not working

2015-09-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10741.
--
   Resolution: Fixed
Fix Version/s: 1.6.0
   1.5.2

This issue has been resolved by https://github.com/apache/spark/pull/8889.

> Hive Query Having/OrderBy against Parquet table is not working 
> ---
>
> Key: SPARK-10741
> URL: https://issues.apache.org/jira/browse/SPARK-10741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ian
>Assignee: Wenchen Fan
> Fix For: 1.5.2, 1.6.0
>
>
> Failed Query with Having Clause
> {code}
>   def testParquetHaving() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedHaving =
>   """ SELECT c1, avg ( c2 ) as c_avg
> | FROM test
> | GROUP BY c1
> | HAVING ( avg ( c2 ) > 5)  ORDER BY c1""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedHaving).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#16 missing 
> from c1#17,c2#18 in operator !Aggregate [c1#17], [cast((avg(cast(c2#16 as 
> bigint)) > cast(5 as double)) as boolean) AS 
> havingCondition#12,c1#17,avg(cast(c2#18 as bigint)) AS c_avg#9];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> {code}
> Failed Query with OrderBy
> {code}
>   def testParquetOrderBy() {
> val ddl =
>   """CREATE TABLE IF NOT EXISTS test ( c1 string, c2 int ) STORED AS 
> PARQUET"""
> val failedOrderBy =
>   """ SELECT c1, avg ( c2 ) c_avg
> | FROM test
> | GROUP BY c1
> | ORDER BY avg ( c2 )""".stripMargin
> TestHive.sql(ddl)
> TestHive.sql(failedOrderBy).collect
>   }
> org.apache.spark.sql.AnalysisException: resolved attribute(s) c2#33 missing 
> from c1#34,c2#35 in operator !Aggregate [c1#34], [avg(cast(c2#33 as bigint)) 
> AS aggOrder#31,c1#34,avg(cast(c2#35 as bigint)) AS c_avg#28];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909764#comment-14909764
 ] 

Ondřej Smola edited comment on SPARK-8734 at 9/27/15 2:55 PM:
--

Problem is limitation of java Properties format - common workaround is using 
using comma as separator - i saw it so many times,
i think we can link 
[Properties.load(java.io.Reader)|http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)]
 in documentation.
Are you ok with it?


was (Author: ondrej.smola):
Problem is limitation of java Properties format - common workaround is using 
using comma as separator - i saw it so many times,
i think we can link 
[Properties.html#load(java.io.Reader)|http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)]
 in documentation.
Are you ok with it?

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10722) Uncaught exception: RDDBlockId not found in driver-heartbeater

2015-09-27 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909883#comment-14909883
 ] 

Michael Malak commented on SPARK-10722:
---

I have seen this in a small Hello World type program compiled and run from sbt 
that reads a large text file and calls .cache(). But if instead I do sbt 
package and then spark-submit (instead of just sbt run), it works. That 
suggests there may be some dependency omitted from Artifactory for spark-core 
but that is in spark-assembly.

This link suggests slf4j-simple.jar, but adding that to my .sbt didn't help.
https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/spark-Exception-in-thread-quot-main-quot-java-lang/td-p/19544

Googling, it seems the problem is more commonly encountered while running unit 
tests during the build of Spark itself.

> Uncaught exception: RDDBlockId not found in driver-heartbeater
> --
>
> Key: SPARK-10722
> URL: https://issues.apache.org/jira/browse/SPARK-10722
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Simeon Simeonov
>
> Some operations involving cached RDDs generate an uncaught exception in 
> driver-heartbeater. If the {{.cache()}} call is removed, processing happens 
> without the exception. However, not all RDDs trigger the problem, i.e., some 
> {{.cache()}} operations are fine. 
> I can see the problem with 1.4.1 and 1.5.0 but I have not been able to create 
> a reproducible test case. The same exception is [reported on 
> SO|http://stackoverflow.com/questions/31280355/spark-test-on-local-machine] 
> for v1.3.1 but the behavior is related to large broadcast variables.
> The full stack trace is:
> {code}
> 15/09/20 22:10:08 ERROR Utils: Uncaught exception in thread driver-heartbeater
> java.io.IOException: java.lang.ClassNotFoundException: 
> org.apache.spark.storage.RDDBlockId
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
>   at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> 

[jira] [Resolved] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10720.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8879
[https://github.com/apache/spark/pull/8879]

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10720:
--
Assignee: holdenk

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 1.6.0
>
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2015-09-27 Thread Miklos Christine (JIRA)
Miklos Christine created SPARK-10848:


 Summary: Applied JSON Schema Works for json RDD but not when 
loading json file
 Key: SPARK-10848
 URL: https://issues.apache.org/jira/browse/SPARK-10848
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Miklos Christine
Priority: Minor


Using a defined schema to load a json rdd works as expected. Loading the json 
records from a file does not apply the supplied schema. Mainly the nullable 
field isn't applied correctly. Loading from a file uses nullable=true on all 
fields regardless of applied schema. 

Code to reproduce:
{code}
import  org.apache.spark.sql.types._

val jsonRdd = sc.parallelize(List(
  """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
"ProductCode": "WQT648", "Qty": 5}""",
  """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
"expressDelivery":true}"""))

val mySchema = StructType(Array(
  StructField(name="OrderID"   , dataType=LongType, nullable=false),
  StructField("CustomerID", IntegerType, false),
  StructField("OrderDate", DateType, false),
  StructField("ProductCode", StringType, false),
  StructField("Qty", IntegerType, false),
  StructField("Discount", FloatType, true),
  StructField("expressDelivery", BooleanType, true)
))

val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema


val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}

Orders.json
{code}
{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
"WQT648", "Qty": 5}
{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
"LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
{code}

The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909873#comment-14909873
 ] 

Sean Owen commented on SPARK-10644:
---

Yes, but how much memory do the workers allocate to executors?

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>Priority: Minor
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10846) Stray META-INF in directory spark-shell is launched from causes problems

2015-09-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909872#comment-14909872
 ] 

Sean Owen commented on SPARK-10846:
---

It's a standard Java mechanism, yes. It would have to be on the classpath to do 
anything though. I don't think your cwd is on the classpath, or shouldn't be. 
Can you check if that's somehow true? then the question is how to get rid of 
it. Barring that, yeah it's just how the services discovery mechanism works in 
the JVM, and it outside of Spark.

> Stray META-INF in directory spark-shell is launched from causes problems
> 
>
> Key: SPARK-10846
> URL: https://issues.apache.org/jira/browse/SPARK-10846
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I observed some perplexing errors while running 
> {{$SPARK_HOME/bin/spark-shell}} yesterday (with {{$SPARK_HOME}} pointing at a 
> clean 1.5.0 install):
> {code}
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.fs.s3.S3FileSystem not found
> {code}
> while initializing {{HiveContext}}; full example output is 
> [here|https://gist.github.com/ryan-williams/34210ad640687113e5c3#file-1-5-0-failure].
> The issue was that a stray {{META-INF}} directory from some other project I'd 
> built months ago was sitting in the directory that I'd run {{spark-shell}} 
> from (*not* in my {{$SPARK_HOME}}, just in the directory I happened to be in 
> when I ran {{$SPARK_HOME/bin/spark-shell}}). 
> That {{META-INF}} had a {{services/org.apache.hadoop.fs.FileSystem}} file 
> specifying some provider classes ({{S3FileSystem}} in the example above) that 
> were unsurprisingly not resolvable by Spark.
> I'm not sure if this is purely my fault for attempting to run Spark from a 
> directory with another project's config files laying around, but I find it 
> somewhat surprising that, given a {{$SPARK_HOME}} pointing to a clean Spark 
> install, that {{$SPARK_HOME/bin/spark-shell}} picks up detritus from the 
> {{cwd}} it is called from, so I wanted to at least document it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10847) Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure

2015-09-27 Thread Shea Parkes (JIRA)
Shea Parkes created SPARK-10847:
---

 Summary: Pyspark - DataFrame - Optional Metadata with `None` 
triggers cryptic failure
 Key: SPARK-10847
 URL: https://issues.apache.org/jira/browse/SPARK-10847
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
 Environment: Windows 7
java version "1.8.0_60" (64bit)
Python 3.4.x

Standalone cluster mode (not local[n]; a full local cluster)
Reporter: Shea Parkes
Priority: Minor


If the optional metadata passed to `pyspark.sql.types.StructField` includes a 
pythonic `None`, the `pyspark.SparkContext.createDataFrame` will fail with a 
very cryptic/unhelpful error.

Here is a minimal reproducible example:
{code:none}
# Assumes sc exists
import pyspark.sql.types as types
sqlContext = SQLContext(sc)


literal_metadata = types.StructType([
types.StructField(
'name',
types.StringType(),
nullable=True,
metadata={'comment': 'From accounting system.'}
),
types.StructField(
'age',
types.IntegerType(),
nullable=True,
metadata={'comment': None}
),
])

literal_rdd = sc.parallelize([
['Bob', 34],
['Dan', 42],
])
print(literal_rdd.take(2))

failed_dataframe = sqlContext.createDataFrame(
literal_rdd,
literal_metadata,
)
{code}

This produces the following ~stacktrace:
{noformat}
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 28, in 
  File 
"S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\context.py",
 line 408, in createDataFrame
jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File 
"S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py",
 line 538, in __call__
  File 
"S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\pyspark\sql\utils.py",
 line 36, in deco
return f(*a, **kw)
  File 
"S:\ZQL\Software\Hotware\spark-1.5.0-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py",
 line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.
at 
org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:160)
at 
org.apache.spark.sql.types.Metadata$$anonfun$fromJObject$1.apply(Metadata.scala:127)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.types.Metadata$.fromJObject(Metadata.scala:127)
at 
org.apache.spark.sql.types.DataType$.org$apache$spark$sql$types$DataType$$parseStructField(DataType.scala:173)
at 
org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
at 
org.apache.spark.sql.types.DataType$$anonfun$parseDataType$1.apply(DataType.scala:148)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.types.DataType$.parseDataType(DataType.scala:148)
at org.apache.spark.sql.types.DataType$.fromJson(DataType.scala:96)
at org.apache.spark.sql.SQLContext.parseDataType(SQLContext.scala:961)
at 
org.apache.spark.sql.SQLContext.applySchemaToPythonRDD(SQLContext.scala:970)
at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)
{noformat}

I believe the most important line of the traceback is this one:
{noformat}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o757.applySchemaToPythonRDD.
: java.lang.RuntimeException: Do not support type class scala.Tuple2.
{noformat}

But it wasn't enough for me to figure out the problem; I had to steadily 
simplify my program until I could identify what caused the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-10850) WholeTextFileRDD only affect the first line in each partition

2015-09-27 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-10850:

Component/s: Spark Core

> WholeTextFileRDD only affect the first line in each partition
> -
>
> Key: SPARK-10850
> URL: https://issues.apache.org/jira/browse/SPARK-10850
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Fengdong Yu
>
> {code}
> val sparkConf = new SparkConf()
> val sc = new SparkContext(sparkConf)
>   
> val text = sc.wholeTextFiles("/test/*/", 3)
> text.map(x => x._1 + "^^^" + x._2).collect
> {code}
> output:
> {code}
> hdfs:///test/test1/1.data^^^hello1
> hello2
> hello3
> hdfs:///test/test2/2.data^^^hello1
> hello2
> hello3
> {code}
> I have two datasets under '/test/': /test/test1/1.data;  /test/test2/2.data
> each dataset has three lines: 
> hello1
> hello2
> hello3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10688:


Assignee: Apache Spark

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909738#comment-14909738
 ] 

Apache Spark commented on SPARK-10688:
--

User 'vectorijk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8926

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10688) Python API for AFTSurvivalRegression

2015-09-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10688:


Assignee: (was: Apache Spark)

> Python API for AFTSurvivalRegression
> 
>
> Key: SPARK-10688
> URL: https://issues.apache.org/jira/browse/SPARK-10688
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Xiangrui Meng
>  Labels: starter
>
> After SPARK-10686, we should add Python API for AFTSurvivalRegression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-27 Thread shimizu yoshihiro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909941#comment-14909941
 ] 

shimizu yoshihiro commented on SPARK-10778:
---

Thanks!

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: shimizu yoshihiro
>Priority: Trivial
>  Labels: starter
> Fix For: 1.6.0
>
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10849) Allow user to specify database column type for data frame fields when writing data to jdbc data sources.

2015-09-27 Thread Suresh Thalamati (JIRA)
Suresh Thalamati created SPARK-10849:


 Summary: Allow user to specify database column type for data frame 
fields when writing data to jdbc data sources. 
 Key: SPARK-10849
 URL: https://issues.apache.org/jira/browse/SPARK-10849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: Suresh Thalamati
Priority: Minor


Mapping data frame field type to database column type is addressed to large  
extent by  adding dialects, and Adding  maxlength option in SPARK-10101 to set 
the  VARCHAR length size. 

In some cases it is hard to determine max supported VARCHAR size , For example 
DB2 Z/OS VARCHAR size depends on the page size.  And some databases also has 
ROW SIZE limits for VARCHAR.  Specifying default CLOB for all String columns  
will likely make read/write slow. 

Allowing users to specify database type corresponding to the data frame field 
will be useful in cases where users wants to fine tune mapping for one or two 
fields, and is fine with default for all other fields .  

I propose to make the following two properties available for users to set in 
the data frame metadata when writing to JDBC data sources.
database.column.type  --  column type to use for create table.
jdbc.column.type" --  jdbc type to  use for setting null values. 

Example :
  val secdf = sc.parallelize( Array(("Apple","Revenue ..."), 
("Google","Income:123213"))).toDF("name", "report")

  val  metadataBuilder = new MetadataBuilder()
  metadataBuilder.putString("database.column.type", "CLOB(100K)")
  metadataBuilder.putLong("jdbc.type", java.sql.Types.CLOB)
  val metadta =  metadataBuilder.build()
  val secReportDF = secdf.withColumn("report", col("report").as("report", 
metadata))
  secReporrDF.write.jdbc("jdbc:mysql:///secdata", "reports", mysqlProps)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7160) Support converting DataFrames to typed RDDs.

2015-09-27 Thread Ray Ortigas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909971#comment-14909971
 ] 

Ray Ortigas commented on SPARK-7160:


Hi [~marmbrus], I just updated the PR again based on your feedback, rebasing on 
a commit from 9/25.

> Support converting DataFrames to typed RDDs.
> 
>
> Key: SPARK-7160
> URL: https://issues.apache.org/jira/browse/SPARK-7160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: Ray Ortigas
>Assignee: Ray Ortigas
>Priority: Critical
>
> As a Spark user still working with RDDs, I'd like the ability to convert a 
> DataFrame to a typed RDD.
> For example, if I've converted RDDs to DataFrames so that I could save them 
> as Parquet or CSV files, I would like to rebuild the RDD from those files 
> automatically rather than writing the row-to-type conversion myself.
> {code}
> val rdd0 = sc.parallelize(Seq(Food("apple", 1), Food("banana", 2), 
> Food("cherry", 3)))
> val df0 = rdd0.toDF()
> df0.save("foods.parquet")
> val df1 = sqlContext.load("foods.parquet")
> val rdd1 = df1.toTypedRDD[Food]()
> // rdd0 and rdd1 should have the same elements
> {code}
> I originally submitted a smaller PR for spark-csv 
> , but Reynold Xin suggested 
> that converting a DataFrame to a typed RDD wasn't something specific to 
> spark-csv.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10850) wholeTextFileRDD only affect the first line in each partition

2015-09-27 Thread Fengdong Yu (JIRA)
Fengdong Yu created SPARK-10850:
---

 Summary: wholeTextFileRDD only affect the first line in each 
partition
 Key: SPARK-10850
 URL: https://issues.apache.org/jira/browse/SPARK-10850
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.1
Reporter: Fengdong Yu


{code}
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)

val text = sc.wholeTextFiles("/test/*/", 3)
text.map(x => x._1 + "^^^" + x._2).collect
{code}

output:
{code}
hdfs:///test/test1/1.data^^^hello1
hello2
hello3
hdfs:///test/test2/2.data^^^hello1
hello2
hello3
{code}

I have two datasets under '/test/': /test/test1/1.data;  /test/test2/2.data

each dataset has three lines: 
hello1
hello2
hello3






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10850) WholeTextFileRDD only affect the first line in each partition

2015-09-27 Thread Fengdong Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fengdong Yu updated SPARK-10850:

Summary: WholeTextFileRDD only affect the first line in each partition  
(was: wholeTextFileRDD only affect the first line in each partition)

> WholeTextFileRDD only affect the first line in each partition
> -
>
> Key: SPARK-10850
> URL: https://issues.apache.org/jira/browse/SPARK-10850
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Fengdong Yu
>
> {code}
> val sparkConf = new SparkConf()
> val sc = new SparkContext(sparkConf)
>   
> val text = sc.wholeTextFiles("/test/*/", 3)
> text.map(x => x._1 + "^^^" + x._2).collect
> {code}
> output:
> {code}
> hdfs:///test/test1/1.data^^^hello1
> hello2
> hello3
> hdfs:///test/test2/2.data^^^hello1
> hello2
> hello3
> {code}
> I have two datasets under '/test/': /test/test1/1.data;  /test/test2/2.data
> each dataset has three lines: 
> hello1
> hello2
> hello3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4066) Make whether maven builds fails on scalastyle violation configurable

2015-09-27 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-4066:
--
Description: 
Here is the thread Koert started:

http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bit=scalastyle+annoys+me+a+little+bit

It would be flexible if whether maven build fails due to scalastyle violation 
configurable.

  was:
Here is the thread Koert started:

http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bit=scalastyle+annoys+me+a+little+bit


It would be flexible if whether maven build fails due to scalastyle violation 
configurable.


> Make whether maven builds fails on scalastyle violation configurable
> 
>
> Key: SPARK-4066
> URL: https://issues.apache.org/jira/browse/SPARK-4066
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Ted Yu
>Priority: Minor
>  Labels: style
> Attachments: spark-4066-v1.txt
>
>
> Here is the thread Koert started:
> http://search-hadoop.com/m/JW1q5j8z422/scalastyle+annoys+me+a+little+bit=scalastyle+annoys+me+a+little+bit
> It would be flexible if whether maven build fails due to scalastyle violation 
> configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10849) Allow user to specify database column type for data frame fields when writing data to jdbc data sources.

2015-09-27 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909972#comment-14909972
 ] 

Suresh Thalamati commented on SPARK-10849:
--

I am working on creating pull request for this issue. 

> Allow user to specify database column type for data frame fields when writing 
> data to jdbc data sources. 
> -
>
> Key: SPARK-10849
> URL: https://issues.apache.org/jira/browse/SPARK-10849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Suresh Thalamati
>Priority: Minor
>
> Mapping data frame field type to database column type is addressed to large  
> extent by  adding dialects, and Adding  maxlength option in SPARK-10101 to 
> set the  VARCHAR length size. 
> In some cases it is hard to determine max supported VARCHAR size , For 
> example DB2 Z/OS VARCHAR size depends on the page size.  And some databases 
> also has ROW SIZE limits for VARCHAR.  Specifying default CLOB for all String 
> columns  will likely make read/write slow. 
> Allowing users to specify database type corresponding to the data frame field 
> will be useful in cases where users wants to fine tune mapping for one or two 
> fields, and is fine with default for all other fields .  
> I propose to make the following two properties available for users to set in 
> the data frame metadata when writing to JDBC data sources.
> database.column.type  --  column type to use for create table.
> jdbc.column.type" --  jdbc type to  use for setting null values. 
> Example :
>   val secdf = sc.parallelize( Array(("Apple","Revenue ..."), 
> ("Google","Income:123213"))).toDF("name", "report")
>   val  metadataBuilder = new MetadataBuilder()
>   metadataBuilder.putString("database.column.type", "CLOB(100K)")
>   metadataBuilder.putLong("jdbc.type", java.sql.Types.CLOB)
>   val metadta =  metadataBuilder.build()
>   val secReportDF = secdf.withColumn("report", col("report").as("report", 
> metadata))
>   secReporrDF.write.jdbc("jdbc:mysql:///secdata", "reports", mysqlProps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909990#comment-14909990
 ] 

SuYan edited comment on SPARK-10796 at 9/28/15 2:59 AM:


Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Task1.0 
in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2), 
assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting 
TaskSchedulerImp schedule.
 TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s 
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), 
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, 
but his TaskSets are all removed, so hangs







was (Author: suyan):

Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Task1.0 
in ExecB, waiting Task2.0
---> Task1 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2), 
assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting 
TaskSchedulerImp schedule.
 TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s 
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), 
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, 
but his TaskSets are all removed, so hangs






> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909990#comment-14909990
 ] 

SuYan commented on SPARK-10796:
---


Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Task1.0 
in ExecB, waiting Task2.0
---> Task1 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2), 
assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting 
TaskSchedulerImp schedule.
 TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s 
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), 
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, 
but his TaskSets are all removed, so hangs






> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10829:
-
Priority: Critical  (was: Blocker)

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Critical
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10829) Scan DataSource with predicate expression combine partition key and attributes doesn't work

2015-09-27 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10829:
-
Target Version/s: 1.5.2, 1.6.0

> Scan DataSource with predicate expression combine partition key and 
> attributes doesn't work
> ---
>
> Key: SPARK-10829
> URL: https://issues.apache.org/jira/browse/SPARK-10829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> To reproduce that with the code:
> {code}
> withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/part=1"
> (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
> // If the "part = 1" filter gets pushed down, this query will throw 
> an exception since
> // "part" is not a valid column in the actual Parquet file
> checkAnswer(
>   sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 
> 1)"),
>   (2 to 3).map(i => Row(i, i.toString, 1)))
>   }
> }
> {code}
> We expect the result as:
> {code}
> 2, 1
> 3, 1
> {code}
> But we got:
> {code}
> 1, 1
> 2, 1
> 3, 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Comment Edited] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909990#comment-14909990
 ] 

SuYan edited comment on SPARK-10796 at 9/28/15 3:00 AM:


Running Stage 0.0, running TaskSet0.0, Finshed task0.0 in ExecA,  running 
Task1.0 in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0.1, running TaskSet0.1(which re-run Task1, 
Task2), assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting 
TaskSchedulerImp schedule.
 TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s 
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), 
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, 
but his TaskSets are all removed, so hangs







was (Author: suyan):
Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Task1.0 
in ExecB, waiting Task2.0
---> Task1.0 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2), 
assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting 
TaskSchedulerImp schedule.
 TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s 
Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), 
TaskSchedulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, 
but his TaskSets are all removed, so hangs






> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10796:


Assignee: (was: Apache Spark)

> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10796:


Assignee: Apache Spark

> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Assignee: Apache Spark
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors

2015-09-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909996#comment-14909996
 ] 

Apache Spark commented on SPARK-10796:
--

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/8927

> The Stage taskSets may are all removed while stage still have pending 
> partitions after having lost some executors
> -
>
> Key: SPARK-10796
> URL: https://issues.apache.org/jira/browse/SPARK-10796
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.3.0
>Reporter: SuYan
>Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, 
> and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one 
> Active TaskSet, multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are 
> all process success, namely each task‘s *MapStatus* are all add into 
> *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / 
> Zombie TaskSet2 // Active TaskSetN, and may some MapStatus only belong to 
> one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* 
> are succeed by some Zombie TaskSet.  In current logical, The solution to 
> resolved that lost *MapStatus* problem is, each *TaskSet* re-running that 
> those tasks which succeed in lost-executor: re-add into *TaskSet's 
> pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions* . but 
> it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is 
> Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws 
> *FetchFailedException*, but may the lost-executor just not empty any 
> *MapStatus* of parent Stage for one of running Stages, and it‘s happen to 
> that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So 
> if all Zombie TaskSets are all processed his runningTasks and Active TaskSet 
> are all processed his pendingTask, then will removed by *TaskSchedulerImp*, 
> then that running Stage's *pending partitions* is still nonEmpty. it will 
> hangs..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909726#comment-14909726
 ] 

Sean Owen commented on SPARK-10644:
---

What is your memory config -- just to double-check? as you say, it's strange, 
so worth chasing down every lead. 

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>Priority: Minor
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-27 Thread Balagopal Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14909758#comment-14909758
 ] 

Balagopal Nair commented on SPARK-10644:


That's true.. 
My memory config is 512m per executor
Each machine has 6.7G of available RAM

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>Priority: Minor
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10778) Implement toString for AssociationRules.Rule

2015-09-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10778.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8904
[https://github.com/apache/spark/pull/8904]

> Implement toString for AssociationRules.Rule
> 
>
> Key: SPARK-10778
> URL: https://issues.apache.org/jira/browse/SPARK-10778
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: shimizu yoshihiro
>Priority: Trivial
>  Labels: starter
> Fix For: 1.6.0
>
>
> pretty print for association rules, e.g.
> {code}
> {a, b, c} => {d}: 0.8
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10597) MultivariateOnlineSummarizer for weighted instances

2015-09-27 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-10597:

Assignee: DB Tsai

> MultivariateOnlineSummarizer for weighted instances
> ---
>
> Key: SPARK-10597
> URL: https://issues.apache.org/jira/browse/SPARK-10597
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> MultivariateOnlineSummarizer for weighted instances is implemented as private 
> API for SPARK-7685.
> In SPARK-7685, the online numerical stable version of unbiased estimation of 
> variance defined by the reliability weights: 
> [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]]
>  is implemented, but we would like to make it as public api since there are 
> different use-cases.
> Currently, `count` will return the actual number of instances, and ignores 
> instance weights, but `numNonzeros` will return the weighted # of nonzeros. 
> We need to decide the behavior of them before making it public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org