Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread DB Tsai
yarn.populateHadoopClasspath" is used in YARN mode correct? > However our Spark cluster is standalone cluster not using YARN. > We only connect to HDFS/Hive to access data.Computation is done on our spark > cluster running on K8s (not Yarn) > > > On Mon, Jul 20, 2020 at 2:0

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
e with HDFS/Hive running on Hadoop 2.6 ? > > Best Regards, -- Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5B25A8F7A82C1 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 3.0.1 not connecting with Hive 2.1.1

2021-01-09 Thread DB Tsai
Hi Pradyumn, I think it’s because of a HMS client backward compatibility issue described here, https://issues.apache.org/jira/browse/HIVE-24608 Thanks, DB Tsai | ACI Spark Core |  Apple, Inc > On Jan 9, 2021, at 9:53 AM, Pradyumn Agrawal wrote: > > Hi Michael, > Thanks fo

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread DB Tsai
Thank you, Huaxin for the 3.2.1 release! Sent from my iPhone > On Jan 28, 2022, at 5:45 PM, Chao Sun wrote: > >  > Thanks Huaxin for driving the release! > >> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng wrote: >> It's Great! >> Congrats and thanks, huaxin! >> >> >> -- 原始邮

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread DB Tsai
Hi Xin, If you take a look at the model you trained, the intercept from Spark is significantly smaller than StatsModel, and the intercept represents a prior on categories in LOR which causes the low accuracy in Spark implementation. In LogisticRegressionWithLBFGS, the intercept is regularized due

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread DB Tsai
Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, May 22, 2015 at 10:45 AM, Xin Liu wrote: > Thank you guys for the prompt help. > > I ended up building spark master and verified what DB has suggested. > >

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread DB Tsai
In Spark 1.4, Logistic Regression with elasticNet is implemented in ML pipeline framework. Model selection can be achieved through high lambda resulting lots of zero in the coefficients. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
If with mesos, how do we control the number of executors? In our cluster, each node only has one executor with very big JVM. Sometimes, if the executor dies, all the concurrent running tasks will be gone. We would like to have multiple executors in one node but can not figure out a way to do it in

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai
Typo. We can not figure a way to increase the number of executor in one node in mesos. On Wednesday, May 27, 2015, DB Tsai wrote: > If with mesos, how do we control the number of executors? In our cluster, > each node only has one executor with very big JVM. Sometimes, if the > exec

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread DB Tsai
result from R. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, May 27, 2015 at 9:08 PM, Maheshakya Wijewardena wrote: > > Hi, > > I'm trying to use Sparks' LinearRegressionWithSGD in PySpark with the > atta

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread DB Tsai
>> https://issues.apache.org/jira/browse/SPARK-7674 >> >> To answer your question: "How are the weights calculated: is there a >> correlation calculation with the variable of interest?" >> --> Weights are calculated as with all logistic regression algorit

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline a

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
excuse any typos. > > On Jun 3, 2015, at 9:53 PM, DB Tsai > wrote: > > Which part of StandardScaler is slow? Fit or transform? Fit has shuffle > but very small, and transform doesn't do shuffle. I guess you don't have > enough partition, so please repartition y

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
By default, the depth of the tree is 2. Each partition will be one node. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar wrote: > Hey Reza, > > Thanks for your response! > >

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai
> vary at each level? > > Thanks! > > > On Thursday, June 4, 2015, DB Tsai wrote: >> >> By default, the depth of the tree is 2. Each partition will be one node. >> >> Sincerely, >> >> DB Tsai >>

Re: Linear Regression with SGD

2015-06-09 Thread DB Tsai
As Robin suggested, you may try the following new implementation. https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <ht

Re: Implementing top() using treeReduce()

2015-06-09 Thread DB Tsai
queue2 } }.toArray.sorted(ord) } } } def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { treeTakeOrdered(num)(ord.reverse) } Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xA

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
We have tests to verify that the results match R. > > @Naveen: Please feel free to add/comment on the above points as you see > necessary. > > Thanks, > Sauptik. > > -Original Message- > From: DB Tsai > Sent: Tuesday, June 16, 2015 2:08 PM > To: Ramakrishnan

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai
Hi Dhar, For "standardization", we can disable it effectively by using different regularization on each component. Thus, we're solving the same problem but having better rate of convergence. This is one of the features I will implement. Sinc

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
You need to build the spark assembly with your modification and deploy into cluster. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar wrote: > I’ve implemented t

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai
all of them. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Wed, Jun 17, 2015 at 5:15 PM, Raghav Shankar wrote: > So, I would add the assembly jar to the just the master or would I have to > add it

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-19 Thread DB Tsai
ith scalability. Here is the talk I gave in Spark summit about the new elastic-net feature in ML. I will encourage you to try the one ML. http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit Sinc

Re: Missing values support in Mllib yet?

2015-06-19 Thread DB Tsai
Not really yet. But at work, we do GBDT missing values imputation, so I've the interest to port them to mllib if I have enough time. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Jun 19, 2015 at 1:

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
o you don't see it explicitly, but the code is in line 128. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Jun 23, 2015 at 3:14 PM, Wei Zhou wrote: > Hi DB Tsai, > > Thanks for your reply.

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai
Please see the current version of code for better documentation. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP

Re: FW: MLLIB (Spark) Question.

2015-07-08 Thread DB Tsai
Hi Dhar, Disabling `standardization` feature is just merged in master. https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1 Let us know your feedback. Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com

Re: Incomplete data when reading from S3

2016-03-19 Thread DB Tsai
You need to use wholetextfiles to read the whole file at once. Otherwise, It can be split. DB Tsai - Sent From My Phone On Mar 17, 2016 12:45 AM, "Blaž Šnuderl" wrote: > Hi. > > We have json data stored in S3 (json record per line). When reading the > data from s3 using

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread DB Tsai
+1 for renaming the jar file. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly wrote: > perhaps renaming to Spark ML would actually clear up code and documentat

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai
Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue. import sys.process._ val p = 1 to 100 val rdd = sc.parallelize(p, 100) val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai
try to refactor those code to share more.) Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu wrote: >

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
LinearRegressionWithSGD is not stable. Please use linear regression in ML package instead. http://spark.apache.org/docs/latest/ml-linear-methods.html Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Oct 25

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai
Column 4 is always constant, so no predictive power resulting zero weight. On Sunday, October 25, 2015, Zhiliang Zhu wrote: > Hi DB Tsai, > > Thanks very much for your kind reply help. > > As for your comment, I just modified and tested the key part of the codes: > > Line

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Interesting. For feature sub-sampling, is it per-node or per-tree? Do you think you can implement generic GBM and have it merged as part of Spark codebase? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai
Also, does it support categorical feature? Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai wrote: > Interesting. For feature sub-sampling, is it per-node or per-tree?

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
tting more than shrinkage). Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu wrote: > Hi DB Tsai, > > Thank you very much for your interest and comment. &

Re: [Spark MLlib] about linear regression issue

2015-11-01 Thread DB Tsai
n to our current linear regression, but currently, there is no open source implementation in Spark. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Nov 1, 2015 at 9:22 AM, Zhiliang Zhu wrote: > Dear All, >

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai
Do you think it will be useful to separate those models and model loader/writer code into another spark-ml-common jar without any spark platform dependencies so users can load the models trained by Spark ML in their application and run the prediction? Sincerely, DB Tsai

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
e it back to open source community, we need to address this. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen wrote: > This is all starting to sound a lot like what'

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
This will bring the whole dependencies of spark will may break the web app. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando wrote: > > > On Fri, Nov 13,

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread DB Tsai
to be small enough to return the result to users within reasonable latency, so I doubt how usefulness of the distributed models in real production use-case. For R and Python, we can build a wrapper on-top of the lightweight "spark-ml-common" project. Sinc

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread DB Tsai
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Tue, Nov 17, 2015 at 4:11 PM, n

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread DB Tsai
This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: > Hi

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread DB Tsai
Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu wrote: > > &g

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Could you paste some of your code for diagnosis? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev

Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
openFiles = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect Hope this can help someone in the same situation. Sincerely, DB Tsai -- Blog: ht

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
Your code looks correct for me. How many # of features do you have in this training? How many tasks are running in the job? Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D <https://pgp.mit.edu/pks/lookup?sea

Re: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai
in ./apps/mesos-0.22.1/sbin/mesos-daemon.sh #!/usr/bin/env bash prefix=/apps/mesos-0.22.1 exec_prefix=/apps/mesos-0.22.1 deploy_dir=${prefix}/etc/mesos # Increase the default number of open file descriptors. ulimit -n 8192 Sincerely, DB Tsai

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai
You want to reduce the # of partitions to around the # of executors * cores. Since you have so many tasks/partitions which will give a lot of pressure on treeReduce in LoR. Let me know if this helps. Sincerely, DB Tsai -- Blog: https

Re: a question about LBFGS in Spark

2016-08-24 Thread DB Tsai
otal is actually the regularization part of gradient. // Will add the gradientSum computed from the data with weights in the next step. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D >> On Wed, Aug 24, 2016 at

Re: SPARK ML- Feature Selection Techniques

2016-09-05 Thread DB Tsai
You can try LOR with L1. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Mon, Sep 5, 2016 at 5:31 AM, Bahubali Jain wrote: > Hi, > Do we have any feature selection techniques implementation(wrapper >

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread DB Tsai
Hi Jong, I think the definition from Kaggle is correct. I'm working on implementing ranking metrics in Spark ML now, but the timeline is unknown. Feel free to submit a PR for this in MLlib. Thanks. Sincerely, DB Tsai -- Web:

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-05 Thread DB Tsai
With the latest code in the current master, we're successfully training LOR using Spark ML's implementation with 14M sparse features. You need to tune the depth of aggregation to make it efficient. Sincerely, DB Tsai --

Re: Why dataframe can be more efficient than dataset?

2017-04-13 Thread DB Tsai
There is a JIRA and prototype which analyzes the JVM bytecode in the black box, and convert the closures into catalyst expressions. https://issues.apache.org/jira/browse/SPARK-14083 This potentially can address the issue discussed here. Sincerely, DB Tsai

Re: imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread DB Tsai
We have the weighting algorithms implemented in linear models, but unfortunately, it's not implemented in tree models. It's an important feature, and welcome for PR! Thanks. Sincerely, DB Tsai -- Web: https://www.dbtsai.com

[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai
+user list We are happy to announce the availability of Spark 2.4.1! Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4.0 users to upgrade to this stable release. In Apache Spark 2.4.1, Scala 2.12 support is GA, and it'

Re: Release Apache Spark 2.4.4

2019-08-13 Thread DB Tsai
+1 On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun wrote: > > Hi, All. > > Spark 2.4.3 was released three months ago (8th May). > As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` > since 2.4.3. > > It would be great if we can have Spark 2.4.4. > Shall we start `2.4.4 RC1`

Re: JDK11 Support in Apache Spark

2019-08-24 Thread DB Tsai
Congratulations on the great work! Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 42E5B25A8F7A82C1 On Sat, Aug 24, 2019 at 8:11 AM Dongjoon Hyun wrote: > > Hi, All. > > Thanks to your many many contributions, &g

Re: foreachActive functionality

2015-01-25 Thread DB Tsai
PS, we were using Breeze's activeIterator originally as you can see in the old code, but we found there are overhead there, so we implement our own implementation which results 4x faster. See https://github.com/apache/spark/pull/3288 for detail. Sincerely, DB

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai
PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
hanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Fri, Mar 13, 2015 at 2:41 PM, cjwang wrote: > I am running LogisticRegressionWithLBFGS. I got these lines on my console: > > 2015-03-12 17:3

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai
I would recommend to upload those jars to HDFS, and use add jars option in spark-submit with URI from HDFS instead of URI from local filesystem. Thus, it can avoid the problem of fetching jars from driver which can be a bottleneck. Sincerely, DB Tsai

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
cause problem for the algorithm. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc. wrote: > Hello, > > I am new to spark streaming API. > > I wanted to ask if I can ap

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai
Are you deploying the windows dll to linux machine? Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen wrote: > I think you meant to use the "--files" to deploy the DLLs. I gave a try,

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai
We fixed couple issues in breeze LBFGS implementation. Can you try Spark 1.3 and see if they still exist? Thanks. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang wrote: > I just u

Re: Features scaling

2015-04-21 Thread DB Tsai
Hi Denys, I don't see any issue in your python code, so maybe there is a bug in python wrapper. If it's in scala, I think it should work. BTW, LogsticRegressionWithLBFGS does the standardization internally, so you don't need to do it yourself. It worths giving it a try! Sinc

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai
andles the scaling and intercepts implicitly in objective function so no overhead of creating new transformed dataset. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Wed, Apr 29, 2015 at 1:21 AM, selim namsi wrote: > Thank you fo

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai
LogisticRegression in MLlib package supports multilable classification. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com On Tue, May 5, 2015 at 1:13 PM, peterg wrote: > Hi all, > > I'm looking to implement a Multilabel

Re: Logistic Regression MLLib Slow

2014-06-04 Thread DB Tsai
/latest/mllib-optimization.html for detail. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng wrote: > Hi Krishna, > > Specifying executor

Re: Logistic Regression MLLib Slow

2014-06-04 Thread DB Tsai
Hi Krishna, It should work, and we use it in production with great success. However, the constructor of LogisticRegressionModel is private[mllib], so you have to write your code, and have the package name under org.apache.spark.mllib instead of using scala console. Sincerely, DB Tsai

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai
Hi Aslan, You can check out the unittest code of GradientDescent.runMiniBatchSGD https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala Sincerely, DB Tsai --- My Blog

Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
ty UI tracker for each operation will be very expensive. Is there a way to disable this behavior? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai
What if there are multiple threads using the same spark context, will each of thread have it own UI? In this case, it will quickly run out of the ports. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai
Hi Aslan, Currently, we don't have the utility function to do so. However, you can easily implement this by another map transformation. I'm working on this feature now, and there will be couple different available normalization option users can chose. Sincerely

Re: Using Spark to crack passwords

2014-06-11 Thread DB Tsai
in RDD will be the same as the # of executors, and we can use mapPartition to loop through all the sample in the range without actually storing them in RDD. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com

Re: Normalizations in MLBase

2014-06-12 Thread DB Tsai
if you find something wrong. > > BR, > Aslan > > > > On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov > wrote: >> >> Thanks a lot DB. >> >> I will try to do Znorm normalization using map transformation. >> >> >> BR, >> Aslan

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai
Hi Congrui, Since it's private in mllib package, one workaround will be write your code in scala file with mllib package in order to use the constructor of LogisticRegressionModel. Sincerely, DB Tsai --- My Blog: https://www.dbtsa

Re: pyspark regression results way off

2014-06-16 Thread DB Tsai
Is your data normalized? Sometimes, GD doesn't work well if the data has wide range. If you are willing to write scala code, you can try LBFGS optimizer which converges better than GD. Sincerely, DB Tsai --- My Blog: https://www.dbtsa

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai
Hi Congrui, We're working on weighted regularization, so for intercept, you can just set it as 0. It's also useful when the data is normalized but want to solve the regularization with original data. Sincerely, DB Tsai --- My B

Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai
Hi Congrui, I mean create your own TrainMLOR.scala with all the code provided in the example, and have it under "package org.apache.spark.mllib" Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linke

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, What's different between treeAggregate and aggregate? Why treeAggregate scales better? What if we just use mapPartition, will it be as fast as treeAggregate? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Lin

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai
Hi Xiangrui, Does it mean that mapPartition and then reduce shares the same behavior as aggregate operation which is O(n)? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Tue, Jun 17

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
Memory").getOrElse("1g"), "--executor-memory", conf.get("spark.workerMemory").getOrElse("1g"), "--executor-cores", conf.get("spark.workerCores").getOrElse("1")) } System.setPrope

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai
trace, etc. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers wrote: > db tsai, > if in yarn-cluster mode the driver runs inside yarn, how c

Re: parallel Reduce within a key

2014-06-20 Thread DB Tsai
/1110 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Jun 20, 2014 at 6:57 AM, ansriniv wrote: > Hi, > > I am on Spark 0.9.0 > > I have a 2 node cluster (2 worker nodes) w

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai
There is no python binding for LBFGS. Feel free to submit a PR. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi wrote: > Is a python binding

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-05 Thread DB Tsai
You may try LBFGS to have more stable convergence. In spark 1.1, we will be able to use LBFGS instead of GD in training process. On Jul 4, 2014 1:23 PM, "Thomas Robert" wrote: > Hi all, > > I too am having some issues with *RegressionWithSGD algorithms. > > Concerning your issue Eustache, this co

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai
Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. For both yarn-cluster or yarn-client, Spark will distribute the jars through distributed cache and

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai
spark-clinet mode runs driver in your application's JVM while spark-cluster mode runs driver in yarn cluster. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jul 7, 2014 at 5:

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
may not be straightforward by just changing the version in spark build script. Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires Java 7. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linke

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, "AlexanderRiggers" wrote: > By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by > master? Because I am using v 1.0.0 - Alex > > > > -- > View this message in context: > http://

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
Are you using 1.0 or current master? A bug related to this is fixed in master. On Jul 12, 2014 8:50 AM, "Srikrishna S" wrote: > I am run logistic regression with SGD on a problem with about 19M > parameters (the kdda dataset from the libsvm library) > > I consistently see that the nodes on my com

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai
https://issues.apache.org/jira/browse/SPARK-2156 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S wrote: > I am using the master that I compi

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai
done, and sparse data is supported. It will be interesting to see new benchmark result. Anyone familiar with BIDMach? Are they as fast as they claim? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread DB Tsai
Could you help to provide a test case to verify this issue and open a JIRA to track this? Also, are you interested in submit a PR to fix it? Thanks. Sent from my Google Nexus 5 On Jul 27, 2014 11:07 AM, "Aureliano Buendia" wrote: > Hi, > > The recently added NNLS implementation in MLlib returns

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-02 Thread DB Tsai
I ran into this issue as well. The workaround by copying jar and ivy manually suggested by Shivaram works for me. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 1, 2014 at 3:31

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread DB Tsai
You can try to define a wrapper class for your parser, and create an instance of your parser in companion object as a singleton object. Thus, even you create an object of wrapper in mapPartition every time, each JVM will have only a single instance of your parser object. Sincerely, DB Tsai

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-10 Thread DB Tsai
Spark cached the RDD in JVM, so presumably, yes, the singleton trick should work. Sent from my Google Nexus 5 On Aug 9, 2014 11:00 AM, "Kevin James Matzen" wrote: > I have a related question. With Hadoop, I would do the same thing for > non-serializable objects and setup(). I also had a use ca

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
s here and there, so we're looking forward to your feedback, and please let us know what you think. We'll continue to improve it and we'll be adding Gradient Boosting in the near future as well. Thanks. Sincerely, DB Tsai --- My Blo

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread DB Tsai
me columns. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Aug 11, 2014 at 2:21 PM, Burak Yavuz wrote: > Hi, > > // Initialize the optimizer using logistic regression as the loss

  1   2   >