Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-19 Thread Sujit Pal
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$ > 2.apply(ScalaUDF.scala:87) > at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval( > ScalaUDF.scala:1060) > at org.apache.spark.sql.catalyst.expressions.Alias.eval( > namedExpressions.scala:142) > at org.apache.spark.sql.catalyst.expres

Re: Lemmatization using StanfordNLP in ML 2.0

2016-09-18 Thread Sujit Pal
Hi Janardhan, Maybe try removing the string "test" from this line in your build.sbt? IIRC, this restricts the models JAR to be called from a test. "edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier "models", -sujit On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty

Re: pyspark mappartions ()

2016-05-14 Thread Sujit Pal
I built this recently using the accepted answer on this SO page: http://stackoverflow.com/questions/26741714/how-does-the-pyspark-mappartitions-function-work/26745371 -sujit On Sat, May 14, 2016 at 7:00 AM, Mathieu Longtin wrote: > From memory: > def

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Sujit Pal
Hi Charles, I tried this with dummied out functions which just sum transformations of a list of integers, maybe they could be replaced by algorithms in your case. The idea is to call them through a "god" function that takes an additional type parameter and delegates out to the appropriate

Re: How to create dataframe from SQL Server SQL query

2015-12-07 Thread Sujit Pal
Hi Ningjun, Haven't done this myself, saw your question and was curious about the answer and found this article which you might find useful: http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/ According this article, you can pass in your SQL statement

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Sujit Pal
6 AM, Sean Owen <so...@cloudera.com> wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal <sujitatgt...@gmail.com> wrote: > > Sorry to be a nag, I realize folks with edit rights on the Powered by > Spar

Re: Please add us to the Powered by Spark page

2015-11-23 Thread Sujit Pal
, Content and Event Analytics, Content/Event based Predictive Models and Big Data Processing. We use Scala and Python over Databricks Notebooks for most of our work. Thanks very much, Sujit On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal <sujitatgt...@gmail.com> wrote: > Hello, > > We

Please add us to the Powered by Spark page

2015-11-13 Thread Sujit Pal
Graphs, Content as a Service, Content and Event Analytics, Content/Event based Predictive Models and Big Data Processing. We use Scala and Python over Databricks Notebooks for most of our work. Thanks very much, Sujit Pal Technical Research Director Elsevier Labs sujit@elsevier.com

Re: Prevent possible out of memory when using read/union

2015-11-04 Thread Sujit Pal
Hi Alexander, You may want to try the wholeTextFiles() method of SparkContext. Using that you could just do something like this: sc.wholeTextFiles("hdfs://input_dir") > .saveAsSequenceFile("hdfs://output_dir") The wholeTextFiles returns a RDD of ((filename, content)).

Re: How to close connection in mapPartitions?

2015-10-23 Thread Sujit Pal
Hi Bin, Very likely the RedisClientPool is being closed too quickly before map has a chance to get to it. One way to verify would be to comment out the .close line and see what happens. FWIW I saw a similar problem writing to Solr where I put a commit where you have a close, and noticed that the

Re: How to get inverse Matrix / RDD or how to solve linear system of equations

2015-10-23 Thread Sujit Pal
Hi Zhiliang, For a system of equations AX = y, Linear Regression will give you a best-fit estimate for A (coefficient vector) for a matrix of feature variables X and corresponding target variable y for a subset of your data. OTOH, what you are looking for here is to solve for x a system of

Re: Save RandomForest Model from ML package

2015-10-22 Thread Sujit Pal
Hi Sebastian, You can save models to disk and load them back up. In the snippet below (copied out of a working Databricks notebook), I train a model, then save it to disk, then retrieve it back into model2 from disk. import org.apache.spark.mllib.tree.RandomForest > import

Re: How to subtract two RDDs with same size

2015-09-23 Thread Sujit Pal
Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p => p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together to form an

Re: Calling a method parallel

2015-09-23 Thread Sujit Pal
Hi Tapan, Perhaps this may work? It takes a range of 0..100 and creates an RDD out of them, then calls X(i) on each. The X(i) should be executed on the workers in parallel. Scala: val results = sc.parallelize(0 until 100).map(idx => X(idx)) Python: results =

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Would something like this work? val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0)) -sujit On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu wrote: > Hi Romi, > > Thanks very much for your kind help comment~~ > > In fact there is some valid backgroud of

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
gt; It seems to be OK, however, do you know the corresponding spark Java API > achievement... > Is there any java API as scala sliding, and it seemed that I do not find > spark scala's doc about sliding ... > > Thank you very much~ > Zhiliang > > > > On Monday, September 2

Re: Scala: How to match a java object????

2015-08-18 Thread Sujit Pal
Hi Saif, Would this work? import scala.collection.JavaConversions._ new java.math.BigDecimal(5) match { case x: java.math.BigDecimal = x.doubleValue } It gives me on the scala console. res9: Double = 5.0 Assuming you had a stream of BigDecimals, you could just call map on it.

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Sujit Pal
on Spark, but wanted to throw my spin based my own understanding]. Nothing official about it :) -abhishek- On Jul 31, 2015, at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
in advance for any help you can provide. -sujit On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote: Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Sujit Pal
AM, Igor Berman igor.ber...@gmail.com wrote: What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some

How to increase parallelism of a Spark cluster?

2015-07-31 Thread Sujit Pal
Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def

Re: use S3-Compatible Storage with spark

2015-07-17 Thread Sujit Pal
Hi Schmirr, The part after the s3n:// is your bucket name and folder name, ie s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are unique across S3, so the resulting path is also unique. There is no concept of hostname in s3 urls as far as I know. -sujit On Fri, Jul 17,

Re: Spark on EMR with S3 example (Python)

2015-07-15 Thread Sujit Pal
to provide the keys? Thank you, *From:* Sujit Pal [mailto:sujitatgt...@gmail.com] *Sent:* Tuesday, July 14, 2015 3:14 PM *To:* Pagliari, Roberto *Cc:* user@spark.apache.org *Subject:* Re: Spark on EMR with S3 example (Python) Hi Roberto, I have written PySpark code that reads

Re: Efficiency of leftOuterJoin a cassandra rdd

2015-07-15 Thread Sujit Pal
Hi Wush, One option may be to try a replicated join. Since your rdd1 is small, read it into a collection and broadcast it to the workers, then filter your larger rdd2 against the collection on the workers. -sujit On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc =

Re: PySpark without PySpark

2015-07-10 Thread Sujit Pal
Department of Information Systems University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia On Fri, Jul 10, 2015 at 11:48 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Ashish, Julian's approach is probably better, but few observations: 1) Your SPARK_HOME should be C:\spark-1.3.0

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Ashish, Nice post. Agreed, kudos to the author of the post, Benjamin Benfort of District Labs. Following your post, I get this problem; Again, not my post. I did try setting up IPython with the Spark profile for the edX Intro to Spark

Re: PySpark without PySpark

2015-07-09 Thread Sujit Pal
Systems University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia On Fri, Jul 10, 2015 at 12:02 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Ashish, Your 00-pyspark-setup file looks very different from mine (and from the one described in the blog post). Questions: 1) Do you have

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you are looking for). Can't share the code, but the basic approach is covered in

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
, HADOOP_HOME=D:\WINUTILS, M2_HOME=D:\MAVEN\BIN, MAVEN_HOME=D:\MAVEN\BIN, PYTHON_HOME=C:\PYTHON27\, SBT_HOME=C:\SBT\ Sincerely, Ashish Dutt PhD Candidate Department of Information Systems University of Malaya, Lembah Pantai, 50603 Kuala Lumpur, Malaysia On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal
! On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hi Julian, I recently built a Python+Spark application to do search relevance analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on EC2 (so I don't use the PySpark shell, hopefully thats what you

Re: HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Sujit Pal
Hi Rex, If the CSV files are in the same folder and there are no other files, specifying the directory to sc.textFiles() (or equivalent) will pull in all the files. If there are other files, you can pass in a pattern that would capture the two files you care about (if thats possible). If neither

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Sujit Pal
Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote: Is it

Re: Access several s3 buckets, with credentials containing /

2015-06-06 Thread Sujit Pal
Hi Pierre, One way is to recreate your credentials until AWS generates one without a slash character in it. Another way I've been using is to pass these credentials outside the S3 file path by setting the following (where sc is the SparkContext).

Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
Hello all, This is probably me doing something obviously wrong, would really appreciate some pointers on how to fix this. I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [ https://spark.apache.org/downloads.html] and just untarred it on a local drive. I am on Mac OSX

Re: Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
this permanent I put this in conf/spark-env.sh. -sujit On Sat, May 23, 2015 at 8:14 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hello all, This is probably me doing something obviously wrong, would really appreciate some pointers on how to fix this. I installed spark-1.3.1-bin-hadoop2.6.tgz from