org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$
> 2.apply(ScalaUDF.scala:87)
> at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(
> ScalaUDF.scala:1060)
> at org.apache.spark.sql.catalyst.expressions.Alias.eval(
> namedExpressions.scala:142)
> at org.apache.spark.sql.catalyst.expres
Hi Janardhan,
Maybe try removing the string "test" from this line in your build.sbt?
IIRC, this restricts the models JAR to be called from a test.
"edu.stanford.nlp" % "stanford-corenlp" % "3.6.0" % "test" classifier
"models",
-sujit
On Sun, Sep 18, 2016 at 11:01 AM, janardhan shetty
I built this recently using the accepted answer on this SO page:
http://stackoverflow.com/questions/26741714/how-does-the-pyspark-mappartitions-function-work/26745371
-sujit
On Sat, May 14, 2016 at 7:00 AM, Mathieu Longtin
wrote:
> From memory:
> def
Hi Charles,
I tried this with dummied out functions which just sum transformations of a
list of integers, maybe they could be replaced by algorithms in your case.
The idea is to call them through a "god" function that takes an additional
type parameter and delegates out to the appropriate
Hi Ningjun,
Haven't done this myself, saw your question and was curious about the
answer and found this article which you might find useful:
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
According this article, you can pass in your SQL statement
6 AM, Sean Owen <so...@cloudera.com> wrote:
> Not sure who generally handles that, but I just made the edit.
>
> On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:
> > Sorry to be a nag, I realize folks with edit rights on the Powered by
> Spar
, Content and Event Analytics, Content/Event based Predictive Models
and Big Data Processing. We use Scala and Python over Databricks Notebooks
for most of our work.
Thanks very much,
Sujit
On Fri, Nov 13, 2015 at 9:21 AM, Sujit Pal <sujitatgt...@gmail.com> wrote:
> Hello,
>
> We
Graphs, Content as a
Service, Content and Event Analytics, Content/Event based Predictive Models
and Big Data Processing. We use Scala and Python over Databricks Notebooks
for most of our work.
Thanks very much,
Sujit Pal
Technical Research Director
Elsevier Labs
sujit@elsevier.com
Hi Alexander,
You may want to try the wholeTextFiles() method of SparkContext. Using that
you could just do something like this:
sc.wholeTextFiles("hdfs://input_dir")
> .saveAsSequenceFile("hdfs://output_dir")
The wholeTextFiles returns a RDD of ((filename, content)).
Hi Bin,
Very likely the RedisClientPool is being closed too quickly before map has
a chance to get to it. One way to verify would be to comment out the .close
line and see what happens. FWIW I saw a similar problem writing to Solr
where I put a commit where you have a close, and noticed that the
Hi Zhiliang,
For a system of equations AX = y, Linear Regression will give you a
best-fit estimate for A (coefficient vector) for a matrix of feature
variables X and corresponding target variable y for a subset of your data.
OTOH, what you are looking for here is to solve for x a system of
Hi Sebastian,
You can save models to disk and load them back up. In the snippet below
(copied out of a working Databricks notebook), I train a model, then save
it to disk, then retrieve it back into model2 from disk.
import org.apache.spark.mllib.tree.RandomForest
> import
Hi Zhiliang,
How about doing something like this?
val rdd3 = rdd1.zip(rdd2).map(p =>
p._1.zip(p._2).map(z => z._1 - z._2))
The first zip will join the two RDDs and produce an RDD of (Array[Float],
Array[Float]) pairs. On each pair, we zip the two Array[Float] components
together to form an
Hi Tapan,
Perhaps this may work? It takes a range of 0..100 and creates an RDD out of
them, then calls X(i) on each. The X(i) should be executed on the workers
in parallel.
Scala:
val results = sc.parallelize(0 until 100).map(idx => X(idx))
Python:
results =
Hi Zhiliang,
Would something like this work?
val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0))
-sujit
On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu
wrote:
> Hi Romi,
>
> Thanks very much for your kind help comment~~
>
> In fact there is some valid backgroud of
gt; It seems to be OK, however, do you know the corresponding spark Java API
> achievement...
> Is there any java API as scala sliding, and it seemed that I do not find
> spark scala's doc about sliding ...
>
> Thank you very much~
> Zhiliang
>
>
>
> On Monday, September 2
Hi Saif,
Would this work?
import scala.collection.JavaConversions._
new java.math.BigDecimal(5) match { case x: java.math.BigDecimal =
x.doubleValue }
It gives me on the scala console.
res9: Double = 5.0
Assuming you had a stream of BigDecimals, you could just call map on it.
on Spark, but wanted to throw my spin based
my own understanding].
Nothing official about it :)
-abhishek-
On Jul 31, 2015, at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote:
Hello,
I am trying to run a Spark job that hits an external webservice to get
back some information. The cluster is 1
in advance for any help you can provide.
-sujit
On Fri, Jul 31, 2015 at 1:03 PM, Sujit Pal sujitatgt...@gmail.com wrote:
Hello,
I am trying to run a Spark job that hits an external webservice to get
back some information. The cluster is 1 master + 4 workers, each worker has
60GB RAM and 4
AM, Igor Berman igor.ber...@gmail.com wrote:
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:
No one has any ideas?
Is there some
Hello,
I am trying to run a Spark job that hits an external webservice to get back
some information. The cluster is 1 master + 4 workers, each worker has 60GB
RAM and 4 CPUs. The external webservice is a standalone Solr server, and is
accessed using code similar to that shown below.
def
Hi Schmirr,
The part after the s3n:// is your bucket name and folder name, ie
s3n://${bucket_name}/${folder_name}[/${subfolder_name}]*. Bucket names are
unique across S3, so the resulting path is also unique. There is no concept
of hostname in s3 urls as far as I know.
-sujit
On Fri, Jul 17,
to
provide the keys?
Thank you,
*From:* Sujit Pal [mailto:sujitatgt...@gmail.com]
*Sent:* Tuesday, July 14, 2015 3:14 PM
*To:* Pagliari, Roberto
*Cc:* user@spark.apache.org
*Subject:* Re: Spark on EMR with S3 example (Python)
Hi Roberto,
I have written PySpark code that reads
Hi Wush,
One option may be to try a replicated join. Since your rdd1 is small, read
it into a collection and broadcast it to the workers, then filter your
larger rdd2 against the collection on the workers.
-sujit
On Tue, Jul 14, 2015 at 11:33 PM, Deepak Jain deepuj...@gmail.com wrote:
Hi Roberto,
I have written PySpark code that reads from private S3 buckets, it should
be similar for public S3 buckets as well. You need to set the AWS access
and secret keys into the SparkContext, then you can access the S3 folders
and files with their s3n:// paths. Something like this:
sc =
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia
On Fri, Jul 10, 2015 at 11:48 AM, Sujit Pal sujitatgt...@gmail.com
wrote:
Hi Ashish,
Julian's approach is probably better, but few observations:
1) Your SPARK_HOME should be C:\spark-1.3.0
AM, Sujit Pal sujitatgt...@gmail.com wrote:
Hi Ashish,
Nice post.
Agreed, kudos to the author of the post, Benjamin Benfort of District
Labs.
Following your post, I get this problem;
Again, not my post.
I did try setting up IPython with the Spark profile for the edX Intro to
Spark
Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia
On Fri, Jul 10, 2015 at 12:02 AM, Sujit Pal sujitatgt...@gmail.com
wrote:
Hi Ashish,
Your 00-pyspark-setup file looks very different from mine (and from the
one described in the blog post). Questions:
1) Do you have
Hi Julian,
I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
for). Can't share the code, but the basic approach is covered in
, HADOOP_HOME=D:\WINUTILS, M2_HOME=D:\MAVEN\BIN,
MAVEN_HOME=D:\MAVEN\BIN, PYTHON_HOME=C:\PYTHON27\, SBT_HOME=C:\SBT\
Sincerely,
Ashish Dutt
PhD Candidate
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia
On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal
!
On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal sujitatgt...@gmail.com wrote:
Hi Julian,
I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster
on
EC2 (so I don't use the PySpark shell, hopefully thats what you
Hi Rex,
If the CSV files are in the same folder and there are no other files,
specifying the directory to sc.textFiles() (or equivalent) will pull in all
the files. If there are other files, you can pass in a pattern that would
capture the two files you care about (if thats possible). If neither
Hi Rexx,
In general (ie not Spark specific), its best to convert categorical data to
1-hot encoding rather than integers - that way the algorithm doesn't use
the ordering implicit in the integer representation.
-sujit
On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote:
Is it
Hi Pierre,
One way is to recreate your credentials until AWS generates one without a
slash character in it. Another way I've been using is to pass these
credentials outside the S3 file path by setting the following (where sc is
the SparkContext).
Hello all,
This is probably me doing something obviously wrong, would really
appreciate some pointers on how to fix this.
I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [
https://spark.apache.org/downloads.html] and just untarred it on a local
drive. I am on Mac OSX
this permanent I put this in conf/spark-env.sh.
-sujit
On Sat, May 23, 2015 at 8:14 AM, Sujit Pal sujitatgt...@gmail.com wrote:
Hello all,
This is probably me doing something obviously wrong, would really
appreciate some pointers on how to fix this.
I installed spark-1.3.1-bin-hadoop2.6.tgz from
36 matches
Mail list logo