Re: using JavaRDD in spark-redis connector

2015-10-27 Thread Rohith P
got it ..thank u...




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/using-JavaRDD-in-spark-redis-connector-tp14391p14812.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Task not serializable exception

2015-10-27 Thread Rohith Parameshwara
I am getting this spark not serializable exception when running spark submit in 
standalone mode. I am trying to use spark streaming which gets its stream from 
kafka queues.. but it is not able to process the mapping actions on the RDDs 
from the stream ..the code where the serialization exception occurs as follows.
I have a separate class to manage the contexts... which has the respective 
getteres and setters:
public class Contexts {
public RedisContext rc=null;
public SparkContext sc=null;
public Gson serializer = new Gson();
public  SparkConf sparkConf = null;//new 
SparkConf().setAppName("SparkStreamEventProcessingEngine");
public  JavaStreamingContext jssc=null;//new JavaStreamingContext(sparkConf, 
new Duration(2000));
public  Producer kafkaProducer=null;
public  Tuple2 hostTup=null;



The class with the main process logic of spark streaming is as follows:
public final class SparkStreamEventProcessingEngine {
 public Contexts contexts= new Contexts();
 public SparkStreamEventProcessingEngine() {
   }

   public static void main(String[] args) {
 SparkStreamEventProcessingEngine temp=new 
SparkStreamEventProcessingEngine();
 temp.tempfunc();
   }
   private void tempfunc(){

   System.out.println(contexts.getJssc().toString() +"\n"+ 
contexts.getRc().toString()+"\n"+contexts.getSc().toString() +"\n");
 createRewardProducer();
 Properties props = new Properties();
 try {
 
props.load(SparkStreamEventProcessingEngine.class.getResourceAsStream("/application.properties"));
 } catch (IOException e) {
 System.out.println("Error loading application.properties file");
 return ;
 }

 Map topicMap = new HashMap();
 topicMap.put(props.getProperty("kafa.inbound.queue"),1);
 JavaPairReceiverInputDStream messages =
 KafkaUtils.createStream(contexts.getJssc(), 
props.getProperty("kafka.zookeeper.quorum"), 
props.getProperty("kafka.consumer.group"), topicMap);

  //The exception occurs at this line..
 JavaDStream lines = messages.map(new Function, String>() {
   //  private static final long serialVersionUID = 1L;

 public String call(Tuple2 tuple2) {
 return tuple2._2();
 }
 });

 lines.foreachRDD(new Function,Void>() {
public Void call(JavaRDD rdd) throws Exception {
 rdd.foreach(new VoidFunction(){
   public void call(String stringData) throws Exception 
{
Gson serializer = new Gson();
 OfferRedeemed event = serializer.fromJson(stringData, 
OfferRedeemed.class);
 System.out.println("Incoming Event:" + 
event.toString());
 processTactic(event,"51367");
 processTactic(event,"53740");
 }
 });
 return null;
 }
 });

 contexts.getJssc().start();
 contexts.getJssc().awaitTermination();
   }

   private void processTactic(OfferRedeemed event, String tacticId){
 System.out.println(contexts.getRc().toString()+"hi4");

   TacticDefinition tactic = readTacticDefinition(tacticId);
   boolean conditionMet = false;
   if(tactic != null){
   System.out.println("Evaluating event of type :" + 
event.getEventType() + " for Tactic : " + tactic.toString()); And so on.. 
for respective functionalities...

The exception thrown is as follows:


Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:681)
at 
org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:258)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527)
at 
org.apache.spark.streaming.api.java.JavaDStreamLike$class.map(JavaDStreamLike.scala:157)
at 
org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.map(JavaDStreamLike.scala:43)
at 
com.coupons.stream.processing.

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Yup avg works good. So we have alternate functions to use in place on the
functions pointed out earlier. But my point is that are those original
aggregate functions not supposed to be used or I am using them in the wrong
way or is it a bug as I asked in my first mail.

On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:

> Have you tried using avg in place of mean ?
>
> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
> sqlContext.sql("""
> CREATE TEMPORARY TABLE partitionedParquet
> USING org.apache.spark.sql.parquet
> OPTIONS (
>   path '/tmp/partitioned'
> )""")
> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>
> Cheers
>
> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani  > wrote:
>
>> So I tried @Reynold's suggestion. I could get countDistinct and
>> sumDistinct running but  mean and approxCountDistinct do not work. (I
>> guess I am using the wrong syntax for approxCountDistinct) For mean, I
>> think the registry entry is missing. Can someone clarify that as well?
>>
>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani > > wrote:
>>
>>> Will try in a while when I get back. I assume this applies to all
>>> functions other than mean. Also countDistinct is defined along with all
>>> other SQL functions. So I don't get "distinct is not part of function name"
>>> part.
>>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>>
 Try

 count(distinct columnane)

 In SQL distinct is not part of the function name.

 On Tuesday, October 27, 2015, Shagun Sodhani 
 wrote:

> Oops seems I made a mistake. The error message is : Exception in
> thread "main" org.apache.spark.sql.AnalysisException: undefined function
> countDistinct
> On 27 Oct 2015 15:49, "Shagun Sodhani" 
> wrote:
>
>> Hi! I was trying out some aggregate  functions in SparkSql and I
>> noticed that certain aggregate operators are not working. This includes:
>>
>> approxCountDistinct
>> countDistinct
>> mean
>> sumDistinct
>>
>> For example using countDistinct results in an error saying
>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> undefined function cosh;*
>>
>> I had a similar issue with cosh operator
>> 
>> as well some time back and it turned out that it was not registered in 
>> the
>> registry:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>
>>
>> *I* *think it is the same issue again and would be glad to send over
>> a PR if someone can confirm if this is an actual bug and not some mistake
>> on my part.*
>>
>>
>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>> Spark Version: 10.4
>> SparkSql Version: 1.5.1
>>
>> I am using the standard example of (name, age) schema (though I am
>> setting age as Double and not Int as I am trying out maths functions).
>>
>> The entire error stack can be found here
>> .
>>
>> Thanks!
>>
>
>>
>


Filter applied on merged Parquet shemsa with new column fails.

2015-10-27 Thread Hyukjin Kwon
When enabling mergedSchema and predicate filter, this fails since Parquet
filters are pushed down regardless of each schema of the splits (or rather
files).

Dominic Ricard reported this issue (
https://issues.apache.org/jira/browse/SPARK-11103)

Even though this would work okay by setting spark.sql.parquet.filterPushdown
to false, the default value of this is true. So this looks an issue.

My questions are,
is this clearly an issue?
and if so, which way would this be handled?


I thought this is an issue and I made three rough patches for this and
tested them and this looks fine though.

The first approach looks simpler and appropriate as I presume from the
previous approaches such as
https://issues.apache.org/jira/browse/SPARK-11153

However, in terms of safety and performances, I also want to ensure which
one would be a proper approach before trying to open a PR.

1. Simply set false to spark.sql.parquet.filterPushdown when using
mergeSchema

2. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and check if each can accept the
given schema and then, apply the filter only when they all can accept,
which I think it's a bit over-implemented.

3. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and apply the filter to each split
(rather file) that can accept the filter which (I think it's hacky) ends up
different configurations for each task in a job.


Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai,

Thank you again for your insightful comments!

1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?

2) For L2/L1 regularization vs Learning rate (I use this name instead
shrinkage to avoid confusion), I have the following observations:

Suppose G and H are the sum (over the data assigned to a leaf node) of
the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
Then for this leaf node,

* With a learning rate eta, f_{m+1} = f_m - G/H*eta

* With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)

If H>0 (convex loss), both approach lead to "shrinkage":

* For the learning rate approach, the percentage of shrinkage is
uniform for any leaf node.

* For L2 regularization, the percentage of shrinkage would adapt to
the number of instances assigned to a leaf node: more instances =>
larger G and H => less shrinkage. This behavior is intuitive to me. If
the value estimated from this node is based on a large amount of data,
the value should be reliable and less shrinkage is needed.

I suppose we could have something similar for L1.

I am not aware of theoretical results to conclude which method is
better. Likely to be dependent on the data at hand. Implementing
learning rate is on my radar for version 0.2. I should be able to add
it in a week or so. I will send you a note once it is done.

Thanks,

Meihua

On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> Hi Meihua,
>
> For categorical features, the ordinal issue can be solved by trying
> all kind of different partitions 2^(q-1) -1 for q values into two
> groups. However, it's computational expensive. In Hastie's book, in
> 9.2.4, the trees can be trained by sorting the residuals and being
> learnt as if they are ordered. It can be proven that it will give the
> optimal solution. I have a proof that this works for learning
> regression trees through variance reduction.
>
> I'm also interested in understanding how the L1 and L2 regularization
> within the boosting works (and if it helps with overfitting more than
> shrinkage).
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  
> wrote:
>> Hi DB Tsai,
>>
>> Thank you very much for your interest and comment.
>>
>> 1) feature sub-sample is per-node, like random forest.
>>
>> 2) The current code heavily exploits the tree structure to speed up
>> the learning (such as processing multiple learning node in one pass of
>> the training data). So a generic GBM is likely to be a different
>> codebase. Do you have any nice reference of efficient GBM? I am more
>> than happy to look into that.
>>
>> 3) The algorithm accept training data as a DataFrame with the
>> featureCol indexed by VectorIndexer. You can specify which variable is
>> categorical in the VectorIndexer. Please note that currently all
>> categorical variables are treated as ordered. If you want some
>> categorical variables as unordered, you can pass the data through
>> OneHotEncoder before the VectorIndexer. I do have a plan to handle
>> unordered categorical variable using the approach in RF in Spark ML
>> (Please see roadmap in the README.md)
>>
>> Thanks,
>>
>> Meihua
>>
>>
>>
>> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>>> you think you can implement generic GBM and have it merged as part of
>>> Spark codebase?
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>>  wrote:
 Hi Spark User/Dev,

 Inspired by the success of XGBoost, I have created a Spark package for
 gradient boosting tree with 2nd order approximation of arbitrary
 user-defined loss functions.

 https://github.com/rotationsymmetry/SparkXGBoost

 Currently linear (normal) regression, binary classification, Poisson
 regression are supported. You can extend with other loss function as
 well.

 L1, L2, bagging, feature sub-sampling are also employed to avoid 
 overfitting.

 Thank you for testing. I am looking forward to your comments and
 suggestions. Bugs or improvements can be reported through GitHub.

 Many thanks!

 Meihua

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e

Re: Exception when using some aggregate operators

2015-10-27 Thread Ted Yu
Have you tried using avg in place of mean ?

(1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")
sqlContext.sql("""select avg(a) from partitionedParquet""").show()

Cheers

On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani 
wrote:

> So I tried @Reynold's suggestion. I could get countDistinct and
> sumDistinct running but  mean and approxCountDistinct do not work. (I
> guess I am using the wrong syntax for approxCountDistinct) For mean, I
> think the registry entry is missing. Can someone clarify that as well?
>
> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani 
> wrote:
>
>> Will try in a while when I get back. I assume this applies to all
>> functions other than mean. Also countDistinct is defined along with all
>> other SQL functions. So I don't get "distinct is not part of function name"
>> part.
>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>
>>> Try
>>>
>>> count(distinct columnane)
>>>
>>> In SQL distinct is not part of the function name.
>>>
>>> On Tuesday, October 27, 2015, Shagun Sodhani 
>>> wrote:
>>>
 Oops seems I made a mistake. The error message is : Exception in thread
 "main" org.apache.spark.sql.AnalysisException: undefined function
 countDistinct
 On 27 Oct 2015 15:49, "Shagun Sodhani" 
 wrote:

> Hi! I was trying out some aggregate  functions in SparkSql and I
> noticed that certain aggregate operators are not working. This includes:
>
> approxCountDistinct
> countDistinct
> mean
> sumDistinct
>
> For example using countDistinct results in an error saying
> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
> undefined function cosh;*
>
> I had a similar issue with cosh operator
> 
> as well some time back and it turned out that it was not registered in the
> registry:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>
>
> *I* *think it is the same issue again and would be glad to send over
> a PR if someone can confirm if this is an actual bug and not some mistake
> on my part.*
>
>
> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (name, age) schema (though I am
> setting age as Double and not Int as I am trying out maths functions).
>
> The entire error stack can be found here
> .
>
> Thanks!
>

>


Re: Spark.Executor.Cores question

2015-10-27 Thread Richard Marscher
Ah I see, that's a bit more complicated =). If it's possible, would using
`spark.executor.memory` to set the available worker memory used by
executors help alleviate the problem of running on a node that already has
an executor on it? I would assume that would have a constant worst case
overhead per worker and shouldn't matter if two executors on the same node
are from one application or two applications. Maybe you have data that
states otherwise? I suppose it depends on what resources are causing
problems when the executors are imbalanced across the cluster. That could
be an indication of possibly not leaving enough free RAM outside the worker
and executors heap allocations on worker nodes.

On Tue, Oct 27, 2015 at 4:57 PM, mkhaitman  wrote:

> Hi Richard,
>
> Thanks for the response.
>
> I should have added that the specific case where this becomes a problem is
> when one of the executors for that application is lost/killed prematurely,
> and the application attempts to spawn up a new executor without
> consideration as to whether an executor already exists on the other node.
>
> In your example, if one of the executors dies for some reason (memory
> exhaustion, or something crashed it), if there are still free cores on the
> other nodes, it will spawn an extra executor, which can lead to further
> memory problems on the other node that it just spawned on.
>
> Hopefully that clears up what I mean :)
>
> Mark.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Executor-Cores-question-tp14763p14805.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Re: Spark.Executor.Cores question

2015-10-27 Thread mkhaitman
Hi Richard,

Thanks for the response. 

I should have added that the specific case where this becomes a problem is
when one of the executors for that application is lost/killed prematurely,
and the application attempts to spawn up a new executor without
consideration as to whether an executor already exists on the other node. 

In your example, if one of the executors dies for some reason (memory
exhaustion, or something crashed it), if there are still free cores on the
other nodes, it will spawn an extra executor, which can lead to further
memory problems on the other node that it just spawned on. 

Hopefully that clears up what I mean :) 

Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Executor-Cores-question-tp14763p14805.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark.Executor.Cores question

2015-10-27 Thread Richard Marscher
Hi Mark,

if you know your cluster's number of workers and cores per worker you can
set this up when you create a SparkContext and shouldn't need to tinker
with the 'spark.executor.cores' setting. That setting is for running
multiple executors per application per worker, which you are saying you
don't want.

How to do what I'm describing? In standalone mode, you will be assigned
cores in round robin order through the cluster's available workers (someone
correct me if that has changed since 1.3). So if you have 4 workers and set
`spark.cores.max` to `16` on your SparkContext then you will have 4
executors on each worker that are using 4 cores each. If you set
`spark.cores.max` to `6` then two executors would have 2 cores and two
executors would have 1 core.

Hope that helps

On Fri, Oct 23, 2015 at 3:05 PM, mkhaitman  wrote:

> Regarding the 'spark.executor.cores' config option in a Standalone spark
> environment, I'm curious about whether there's a way to enforce the
> following logic:
>
> *- Max cores per executor = 4*
> ** Max executors PER application PER worker = 1*
>
> In order to force better balance across all workers, I want to ensure that
> a
> single spark job can only ever use a specific upper limit on the number of
> cores for each executor it holds, however, do not want a situation where it
> can spawn 3 executors on a worker and only 1/2 on the others. Some spark
> jobs end up using much more memory during aggregation tasks (joins /
> groupBy's) which is more heavily impacted by the number of cores per
> executor for that job.
>
> If this kind of setup/configuration doesn't already exist for Spark, and
> others see the benefit of what I mean by this, where would be the best
> location to insert this logic?
>
> Mark.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Executor-Cores-question-tp14763.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



Pickle Spark DataFrame

2015-10-27 Thread agg212
Hi, I'd like to "pickle" a Spark DataFrame object and have tried the
following:

import pickle
data = sparkContext.jsonFile(data_file) #load file
with open('out.pickle', 'wb') as handle:
pickle.dump(data, handle)

If I convert "data" to a Pandas DataFrame (e.g., using data.toPandas()), the
above code works.  Does anybody have any idea how to do this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Pickle-Spark-DataFrame-tp14803.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
No the job actually doesn't fail, but since our tests is generating all
these stacktraces i have disabled the tungsten mode just to be sure (and
don't have gazilion stacktraces in production).

2015-10-27 20:59 GMT+01:00 Josh Rosen :

> Hi Sjoerd,
>
> Did your job actually *fail* or did it just generate many spurious
> exceptions? While the stacktrace that you posted does indicate a bug, I
> don't think that it should have stopped query execution because Spark
> should have fallen back to an interpreted code path (note the "Failed to
> generate ordering, fallback to interpreted" in the error message).
>
> On Tue, Oct 27, 2015 at 12:56 PM Sjoerd Mulder 
> wrote:
>
>> I have disabled it because of it started generating ERROR's when
>> upgrading from Spark 1.4 to 1.5.1
>>
>> 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to
>> generate ordering, fallback to interpreted
>> java.util.concurrent.ExecutionException: java.lang.Exception: failed to
>> compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9:
>> Invalid character input "@" (character code 64)
>>
>> public SpecificOrdering
>> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
>>   return new SpecificOrdering(expr);
>> }
>>
>> class SpecificOrdering extends
>> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
>>
>>   private org.apache.spark.sql.catalyst.expressions.Expression[]
>> expressions;
>>
>>
>>
>>   public
>> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[]
>> expr) {
>> expressions = expr;
>>
>>   }
>>
>>   @Override
>>   public int compare(InternalRow a, InternalRow b) {
>> InternalRow i = null;  // Holds current row being evaluated.
>>
>> i = a;
>> boolean isNullA2;
>> long primitiveA3;
>> {
>>   /* input[2, LongType] */
>>
>>   boolean isNull0 = i.isNullAt(2);
>>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>>
>>   isNullA2 = isNull0;
>>   primitiveA3 = primitive1;
>> }
>> i = b;
>> boolean isNullB4;
>> long primitiveB5;
>> {
>>   /* input[2, LongType] */
>>
>>   boolean isNull0 = i.isNullAt(2);
>>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>>
>>   isNullB4 = isNull0;
>>   primitiveB5 = primitive1;
>> }
>> if (isNullA2 && isNullB4) {
>>   // Nothing
>> } else if (isNullA2) {
>>   return 1;
>> } else if (isNullB4) {
>>   return -1;
>> } else {
>>   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 <
>> primitiveB5 ? -1 : 0);
>>   if (comp != 0) {
>> return -comp;
>>   }
>> }
>>
>> return 0;
>>   }
>> }
>>
>> at
>> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>> at
>> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>> at
>> org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>> at
>> org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>> at
>> org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>> at
>> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>> at
>> org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>> at
>> org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>> at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
>> at
>> org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>> at
>> org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.compile(CodeGenerator.scala:362)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:139)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:37)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:425)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:422)
>> at
>> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:294)
>> at org.apache.spark.sql.execution.TungstenSort.org
>> $apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131)
>> at
>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>> at
>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>> at
>> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.r

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd,

Did your job actually *fail* or did it just generate many spurious
exceptions? While the stacktrace that you posted does indicate a bug, I
don't think that it should have stopped query execution because Spark
should have fallen back to an interpreted code path (note the "Failed to
generate ordering, fallback to interpreted" in the error message).

On Tue, Oct 27, 2015 at 12:56 PM Sjoerd Mulder 
wrote:

> I have disabled it because of it started generating ERROR's when upgrading
> from Spark 1.4 to 1.5.1
>
> 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to
> generate ordering, fallback to interpreted
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to
> compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9:
> Invalid character input "@" (character code 64)
>
> public SpecificOrdering
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
>   return new SpecificOrdering(expr);
> }
>
> class SpecificOrdering extends
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
>
>   private org.apache.spark.sql.catalyst.expressions.Expression[]
> expressions;
>
>
>
>   public
> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[]
> expr) {
> expressions = expr;
>
>   }
>
>   @Override
>   public int compare(InternalRow a, InternalRow b) {
> InternalRow i = null;  // Holds current row being evaluated.
>
> i = a;
> boolean isNullA2;
> long primitiveA3;
> {
>   /* input[2, LongType] */
>
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>
>   isNullA2 = isNull0;
>   primitiveA3 = primitive1;
> }
> i = b;
> boolean isNullB4;
> long primitiveB5;
> {
>   /* input[2, LongType] */
>
>   boolean isNull0 = i.isNullAt(2);
>   long primitive1 = isNull0 ? -1L : (i.getLong(2));
>
>   isNullB4 = isNull0;
>   primitiveB5 = primitive1;
> }
> if (isNullA2 && isNullB4) {
>   // Nothing
> } else if (isNullA2) {
>   return 1;
> } else if (isNullB4) {
>   return -1;
> } else {
>   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 <
> primitiveB5 ? -1 : 0);
>   if (comp != 0) {
> return -comp;
>   }
> }
>
> return 0;
>   }
> }
>
> at
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> at
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
> at
> org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> at
> org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
> at
> org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
> at
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
> at
> org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at
> org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
> at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
> at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
> at
> org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.compile(CodeGenerator.scala:362)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:139)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:37)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:425)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:422)
> at
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:294)
> at org.apache.spark.sql.execution.TungstenSort.org
> $apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at o

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
I have disabled it because of it started generating ERROR's when upgrading
from Spark 1.4 to 1.5.1

2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to
generate ordering, fallback to interpreted
java.util.concurrent.ExecutionException: java.lang.Exception: failed to
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9:
Invalid character input "@" (character code 64)

public SpecificOrdering
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {

  private org.apache.spark.sql.catalyst.expressions.Expression[]
expressions;



  public
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[]
expr) {
expressions = expr;

  }

  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[2, LongType] */

  boolean isNull0 = i.isNullAt(2);
  long primitive1 = isNull0 ? -1L : (i.getLong(2));

  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[2, LongType] */

  boolean isNull0 = i.isNullAt(2);
  long primitive1 = isNull0 ? -1L : (i.getLong(2));

  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return 1;
} else if (isNullB4) {
  return -1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5
? -1 : 0);
  if (comp != 0) {
return -comp;
  }
}

return 0;
  }
}

at
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at
org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at
org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at
org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at
org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at
org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at
org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.compile(CodeGenerator.scala:362)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:139)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:37)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:425)
at
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:422)
at org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:294)
at org.apache.spark.sql.execution.TungstenSort.org
$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131)
at
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at
org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:59)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


2015-10-14 21:00 GMT+02:00 Reynold Xin :

> Can you reply to this email and provide us with reasons why you disable it?
>
> Thanks.
>
>


Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
So I tried @Reynold's suggestion. I could get countDistinct and sumDistinct
running but  mean and approxCountDistinct do not work. (I guess I am using
the wrong syntax for approxCountDistinct) For mean, I think the
registry entry is missing. Can someone clarify that as well?

On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani 
wrote:

> Will try in a while when I get back. I assume this applies to all
> functions other than mean. Also countDistinct is defined along with all
> other SQL functions. So I don't get "distinct is not part of function name"
> part.
> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>
>> Try
>>
>> count(distinct columnane)
>>
>> In SQL distinct is not part of the function name.
>>
>> On Tuesday, October 27, 2015, Shagun Sodhani 
>> wrote:
>>
>>> Oops seems I made a mistake. The error message is : Exception in thread
>>> "main" org.apache.spark.sql.AnalysisException: undefined function
>>> countDistinct
>>> On 27 Oct 2015 15:49, "Shagun Sodhani"  wrote:
>>>
 Hi! I was trying out some aggregate  functions in SparkSql and I
 noticed that certain aggregate operators are not working. This includes:

 approxCountDistinct
 countDistinct
 mean
 sumDistinct

 For example using countDistinct results in an error saying
 *Exception in thread "main" org.apache.spark.sql.AnalysisException:
 undefined function cosh;*

 I had a similar issue with cosh operator
 
 as well some time back and it turned out that it was not registered in the
 registry:
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala


 *I* *think it is the same issue again and would be glad to send over a
 PR if someone can confirm if this is an actual bug and not some mistake on
 my part.*


 Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
 Spark Version: 10.4
 SparkSql Version: 1.5.1

 I am using the standard example of (name, age) schema (though I am
 setting age as Double and not Int as I am trying out maths functions).

 The entire error stack can be found here 
 .

 Thanks!

>>>


Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Will try in a while when I get back. I assume this applies to all functions
other than mean. Also countDistinct is defined along with all other SQL
functions. So I don't get "distinct is not part of function name" part.
On 27 Oct 2015 19:58, "Reynold Xin"  wrote:

> Try
>
> count(distinct columnane)
>
> In SQL distinct is not part of the function name.
>
> On Tuesday, October 27, 2015, Shagun Sodhani 
> wrote:
>
>> Oops seems I made a mistake. The error message is : Exception in thread
>> "main" org.apache.spark.sql.AnalysisException: undefined function
>> countDistinct
>> On 27 Oct 2015 15:49, "Shagun Sodhani"  wrote:
>>
>>> Hi! I was trying out some aggregate  functions in SparkSql and I noticed
>>> that certain aggregate operators are not working. This includes:
>>>
>>> approxCountDistinct
>>> countDistinct
>>> mean
>>> sumDistinct
>>>
>>> For example using countDistinct results in an error saying
>>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> undefined function cosh;*
>>>
>>> I had a similar issue with cosh operator
>>> 
>>> as well some time back and it turned out that it was not registered in the
>>> registry:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>
>>>
>>> *I* *think it is the same issue again and would be glad to send over a
>>> PR if someone can confirm if this is an actual bug and not some mistake on
>>> my part.*
>>>
>>>
>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>>> Spark Version: 10.4
>>> SparkSql Version: 1.5.1
>>>
>>> I am using the standard example of (name, age) schema (though I am
>>> setting age as Double and not Int as I am trying out maths functions).
>>>
>>> The entire error stack can be found here .
>>>
>>> Thanks!
>>>
>>


Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try

count(distinct columnane)

In SQL distinct is not part of the function name.

On Tuesday, October 27, 2015, Shagun Sodhani 
wrote:

> Oops seems I made a mistake. The error message is : Exception in thread
> "main" org.apache.spark.sql.AnalysisException: undefined function
> countDistinct
> On 27 Oct 2015 15:49, "Shagun Sodhani"  > wrote:
>
>> Hi! I was trying out some aggregate  functions in SparkSql and I noticed
>> that certain aggregate operators are not working. This includes:
>>
>> approxCountDistinct
>> countDistinct
>> mean
>> sumDistinct
>>
>> For example using countDistinct results in an error saying
>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> undefined function cosh;*
>>
>> I had a similar issue with cosh operator
>> 
>> as well some time back and it turned out that it was not registered in the
>> registry:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>
>>
>> *I* *think it is the same issue again and would be glad to send over a
>> PR if someone can confirm if this is an actual bug and not some mistake on
>> my part.*
>>
>>
>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>> Spark Version: 10.4
>> SparkSql Version: 1.5.1
>>
>> I am using the standard example of (name, age) schema (though I am
>> setting age as Double and not Int as I am trying out maths functions).
>>
>> The entire error stack can be found here .
>>
>> Thanks!
>>
>


Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Oops seems I made a mistake. The error message is : Exception in thread
"main" org.apache.spark.sql.AnalysisException: undefined function
countDistinct
On 27 Oct 2015 15:49, "Shagun Sodhani"  wrote:

> Hi! I was trying out some aggregate  functions in SparkSql and I noticed
> that certain aggregate operators are not working. This includes:
>
> approxCountDistinct
> countDistinct
> mean
> sumDistinct
>
> For example using countDistinct results in an error saying
> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
> undefined function cosh;*
>
> I had a similar issue with cosh operator
> 
> as well some time back and it turned out that it was not registered in the
> registry:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>
>
> *I* *think it is the same issue again and would be glad to send over a PR
> if someone can confirm if this is an actual bug and not some mistake on my
> part.*
>
>
> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (name, age) schema (though I am setting
> age as Double and not Int as I am trying out maths functions).
>
> The entire error stack can be found here .
>
> Thanks!
>


Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Hi! I was trying out some aggregate  functions in SparkSql and I noticed
that certain aggregate operators are not working. This includes:

approxCountDistinct
countDistinct
mean
sumDistinct

For example using countDistinct results in an error saying
*Exception in thread "main" org.apache.spark.sql.AnalysisException:
undefined function cosh;*

I had a similar issue with cosh operator

as well some time back and it turned out that it was not registered in the
registry:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala


*I* *think it is the same issue again and would be glad to send over a PR
if someone can confirm if this is an actual bug and not some mistake on my
part.*


Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
Spark Version: 10.4
SparkSql Version: 1.5.1

I am using the standard example of (name, age) schema (though I am setting
age as Double and not Int as I am trying out maths functions).

The entire error stack can be found here .

Thanks!


Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
Yup looks like I missed that. I will build a new one.

On Tuesday, October 27, 2015, Sean Owen  wrote:

> Ah, good point. I also see it still reads 1.5.1. I imagine we just need
> another sweep to update all the version strings.
>
> On Tue, Oct 27, 2015 at 3:08 AM, Krishna Sankar  > wrote:
>
>> Guys,
>>The sc.version returns 1.5.1 in python and scala. Is anyone getting
>> the same results ? Probably I am doing something wrong.
>> Cheers
>> 
>>
>> On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.5.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The release fixes 51 known issues in Spark 1.5.1, listed here:
>>> http://s.apache.org/spark-1.5.2
>>>
>>> The tag to be voted on is v1.5.2-rc1:
>>> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
>>> *
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> - as version 1.5.2-rc1:
>>> https://repository.apache.org/content/repositories/orgapachespark-1151
>>> - as version 1.5.2:
>>> https://repository.apache.org/content/repositories/orgapachespark-1150
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>>>
>>>
>>> ===
>>> How can I help test this release?
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> What justifies a -1 vote for this release?
>>> 
>>> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
>>> present in 1.5.1 will not block this release.
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 1.5.2?
>>> ===
>>> Please target 1.5.3 or 1.6.0.
>>>
>>>
>>>
>>
>


Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Sean Owen
Ah, good point. I also see it still reads 1.5.1. I imagine we just need
another sweep to update all the version strings.

On Tue, Oct 27, 2015 at 3:08 AM, Krishna Sankar  wrote:

> Guys,
>The sc.version returns 1.5.1 in python and scala. Is anyone getting the
> same results ? Probably I am doing something wrong.
> Cheers
> 
>
> On Sun, Oct 25, 2015 at 12:07 AM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The release fixes 51 known issues in Spark 1.5.1, listed here:
>> http://s.apache.org/spark-1.5.2
>>
>> The tag to be voted on is v1.5.2-rc1:
>> https://github.com/apache/spark/releases/tag/v1.5.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> *http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc1-bin/
>> *
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> - as version 1.5.2-rc1:
>> https://repository.apache.org/content/repositories/orgapachespark-1151
>> - as version 1.5.2:
>> https://repository.apache.org/content/repositories/orgapachespark-1150
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-v1.5.2-rc1-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> -1 vote should occur for regressions from Spark 1.5.1. Bugs already
>> present in 1.5.1 will not block this release.
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.2?
>> ===
>> Please target 1.5.3 or 1.6.0.
>>
>>
>>
>


Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
Hi Meihua,

For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they are ordered. It can be proven that it will give the
optimal solution. I have a proof that this works for learning
regression trees through variance reduction.

I'm also interested in understanding how the L1 and L2 regularization
within the boosting works (and if it helps with overfitting more than
shrinkage).

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  wrote:
> Hi DB Tsai,
>
> Thank you very much for your interest and comment.
>
> 1) feature sub-sample is per-node, like random forest.
>
> 2) The current code heavily exploits the tree structure to speed up
> the learning (such as processing multiple learning node in one pass of
> the training data). So a generic GBM is likely to be a different
> codebase. Do you have any nice reference of efficient GBM? I am more
> than happy to look into that.
>
> 3) The algorithm accept training data as a DataFrame with the
> featureCol indexed by VectorIndexer. You can specify which variable is
> categorical in the VectorIndexer. Please note that currently all
> categorical variables are treated as ordered. If you want some
> categorical variables as unordered, you can pass the data through
> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> unordered categorical variable using the approach in RF in Spark ML
> (Please see roadmap in the README.md)
>
> Thanks,
>
> Meihua
>
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org