NaiveBayes for MLPipeline is absent

2015-06-18 Thread Justin Yip
Hello, Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't find the JIRA ticket related to it too (or maybe I missed). Is there a plan to implement it? If no one has the bandwidth, I can work on it. Thanks. Justin -- View this message in context:

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Justin Yip
Done. https://issues.apache.org/jira/browse/SPARK-8420 Justin On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: That sounds like a bug. Could you create a JIRA and ping Yin Huai (cc'ed). -Xiangrui On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
Hello, I am using 1.4.0 and found the following weird behavior. This case works fine: scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index, rand(30)).show() +--+---+---+ |_1| _2| index| +--+---+---+ | 1| 2| 0.6662967911724369| |

Why the default Params.copy doesn't work for Model.copy?

2015-06-04 Thread Justin Yip
Hello, I have a question with Spark 1.4 ml library. In the copy function, it is stated that the default implementation doesn't work of Params doesn't work for models. ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Model.scala#L49 ) As a result, some

Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-05-27 Thread Justin Yip
Hello, I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. scala val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf(2015-01-01 00:00:00)), (2, Timestamp.valueOf(2014-01-01

Building scaladoc using build/sbt unidoc failure

2015-05-26 Thread Justin Yip
Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) I followed the instruction on github

Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Justin Yip
Hello, I am using MLPipeline. I would like to extract the best parameter found by CrossValidator. But I cannot find much document about how to do it. Can anyone give me some pointers? Thanks. Justin -- View this message in context:

Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
Hello, I would like ask know if there are recommended ways of preventing ambiguous columns when joining dataframes. When we join dataframes, it usually happen we join the column with identical name. I could have rename the columns on the right data frame, as described in the following code. Is

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread Justin Yip
be possible to write udafs in similar lines of sum/min etc. On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io wrote: Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
: DataFrame, usingColumn: String): DataFrame* df.join(df1, _1) This has the added benefit of only outputting a single _1 column. On Fri, May 15, 2015 at 3:44 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I would like ask know if there are recommended ways of preventing ambiguous columns

Custom Aggregate Function for DataFrame

2015-05-14 Thread Justin Yip
Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF functions which applies on a Row. Maybe I have missed something. Thanks. Justin -- View this message in context:

Creating StructType with DataFrame.withColumn

2015-04-30 Thread Justin Yip
Hello, I would like to add a StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. I guess I may have

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem: df.withColumn(millis, $eventTime.cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I was able to cast a timestamp into long using

casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
Hello, I was able to cast a timestamp into long using df.withColumn(millis, $eventTime.cast(long) * 1000) in spark 1.3.0. However, this statement returns a failure with spark 1.3.1. I got the following exception: Exception in thread main org.apache.spark.sql.types.DataTypeException: Unsupported

Column renaming after DataFrame.groupBy

2015-04-21 Thread Justin Yip
Hello, I would like rename a column after aggregation. In the following code, the column name is SUM(_1#179), is there a way to rename it to a more friendly name? scala val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10))) scala d.groupBy(_1).sum().printSchema root |-- _1: integer

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Justin Yip
://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L144 On Tue, Apr 7, 2015 at 5:00 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1

Catching executor exception from executor in driver

2015-04-14 Thread Justin Yip
Hello, I would like to know if there is a way of catching exception throw from executor exception from the driver program. Here is an example: try { val x = sc.parallelize(Seq(1,2,3)).map(e = e / 0).collect } catch { case e: SparkException = { println(sERROR: $e) println(sCAUSE:

Fwd: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala bdf.printSchema() root |-- a: string

Re: How to use Joda Time with Spark SQL?

2015-04-12 Thread Justin Yip
Cheng, this is great info. I have a follow up question. There are a few very common data types (i.e. Joda DateTime) that is not directly supported by SparkSQL. Do you know if there are any plans for accommodating some common data types in SparkSQL? They don't need to be a first class datatype, but

The $ notation for DataFrame Column

2015-04-10 Thread Justin Yip
Hello, The DataFrame documentation always uses $columnX to annotates a column. But I cannot find much information about it. Maybe I have missed something. Can anyone point me to the doc about the $, if there is any? Thanks. Justin

DataFrame column name restriction

2015-04-10 Thread Justin Yip
Hello, Are there any restriction in the column name? I tried to use ., but sqlContext.sql cannot find the column. I would guess that . is tricky as this affects accessing StructType, but are there any more restriction on column name? scala case class A(a: Int) defined class A scala

Re: DataFrame groupBy MapType

2015-04-07 Thread Justin Yip
, Michael Armbrust mich...@databricks.com wrote: In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY m['SomeKey'] On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have a case class like this: case class A( m: Map[Long, Long

Expected behavior for DataFrame.unionAll

2015-04-07 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala bdf.printSchema() root |-- a: string

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes. My hunch is

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote: Hello

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
the improvement. On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip yipjus...@prediction.io wrote: The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin

DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from Seq[A]. I would like to perform a groupBy on A.m(SomeKey). I can implement a UDF, create a new Column then invoke a groupBy on the new Column. But is it the idiomatic way of doing

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou, You can look at the recommendation template http://templates.prediction.io/PredictionIO/template-scala-parallel-recommendation of PredictionIO. PredictionIO is built on the top of spark. And this template illustrates how you can save the ALS model to HDFS and the reload it later.

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have been using Mllib's ALS in 1.2

StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there

Catching InvalidClassException in sc.objectFile

2015-03-19 Thread Justin Yip
Hello, I have persisted a RDD[T] to disk through saveAsObjectFile. Then I changed the implementation of T. When I read the file with sc.objectFile using the new binary, I got the exception of java.io.InvalidClassException, which is expected. I try to catch this error via SparkException in the

Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find it anywhere. Do I need to specify anything in the spark ui config? Thanks. Justin

Re: Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Found it. Thanks Patrick. Justin On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell pwend...@gmail.com wrote: It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Justin Yip
Xuelin, There is a function called emtpyRDD under spark context http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext which serves this purpose. Justin On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'd like to create a

MLLib sample data format

2014-06-22 Thread Justin Yip
Hello, I am looking into a couple of MLLib data files in https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any explanation for these files? Does anyone know if they are documented? Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
On Sun, Jun 22, 2014 at 3:24 PM, Justin Yip yipjus...@gmail.com wrote: Hi Shuo, Yes. I was reading the guide as well as the sample code. For example, in http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm, now where in the github repository I

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
. Justin On Sun, Jun 22, 2014 at 2:40 PM, Shuo Xiang shuoxiang...@gmail.com wrote: Hi, you might find http://spark.apache.org/docs/latest/mllib-guide.html helpful. On Sun, Jun 22, 2014 at 2:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data files

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
I see. That's good. Thanks. Justin On Sun, Jun 22, 2014 at 4:59 PM, Evan Sparks evan.spa...@gmail.com wrote: Oh, and the movie lens one is userid::movieid::rating - Evan On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data

MLLib : Decision Tree with minimum points per node

2014-06-13 Thread Justin Yip
Hello, I have been playing around with mllib's decision tree library. It is working great, thanks. I have a question regarding overfitting. It appears to me that the current implementation doesn't allows user to specify the minimum number of samples per node. This results in some nodes only

KyroException: Unable to find class

2014-06-05 Thread Justin Yip
Hello, I have been using Externalizer from Chill to as serialization wrapper. It appears to me that Spark have some conflict with the classloader with Chill. I have the (a simplified version) following program: import java.io._ import com.twitter.chill.Externalizer class X(val i: Int) {