On 10 Feb 2016, at 13:20, Manoj Awasthi
> wrote:
On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran
> wrote:
On 10 Feb 2016, at 04:42, praveen S
Hi, there
I am looking at the SparkSQL setting spark.sql.autoBroadcastJoinThreshold.
According to the programming guide
*Note that currently statistics are only supported for Hive Metastore
tables where the command ANALYZE TABLE COMPUTE STATISTICS
noscan has been run.*
My question is that is
My pardon to writing that "there is no AM". I realize it! :-) :-)
On Wed, Feb 10, 2016 at 7:14 PM, Steve Loughran
wrote:
>
> On 10 Feb 2016, at 13:20, Manoj Awasthi wrote:
>
>
>
> On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran
Hi All,
I apologize for reposting, I wonder if anyone can explain this behavior?
And what would be the best way to resolve this without introducing
something like kafka in the midst.
I basically have a logstash instance, and would like to stream output of
logstash to spark_streaming without
On 10 Feb 2016, at 14:18, Manoj Awasthi
> wrote:
My pardon to writing that "there is no AM". I realize it! :-) :-)
There is the unmanaged AM option, which was originally written for debugging,
but has been used in various apps.
Spark
Here's a wild guess; it might be the fact that your first command uses tail
-f, so it doesn't close the input file handle when it hits the end of the
available bytes, while your second use of nc does this. If so, the last few
lines might be stuck in a buffer waiting to be forwarded. If so, Spark
>
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
> compute statistics" command in Hive shell, is the statistics
> going to be used by SparkSQL to decide broadcast join?
Yes, spark SQL will only accept the simple no scan version. However, as
long as the sizeInBytes
Hello community,
I would like to introduce a new Spark package that should
be useful for python users who depend on scikit-learn.
Among other tools:
- train and evaluate multiple scikit-learn models in parallel.
- convert Spark's Dataframes seamlessly into numpy arrays
- (experimental)
Hi,
I am trying some basic integration and was going through the manual.
I would like to read from a topic, and get a JavaReceiverInputDStream
for messages in that topic. However the example is of
JavaPairReceiverInputDStream<>. How do I get a stream for only a single
topic in Java?
Reference
If you are using systemd, you will need to specify the limit in the service
file. I had run into this problem and discovered the solution from the
following references:
* https://bugzilla.redhat.com/show_bug.cgi?id=754285#c1
* http://serverfault.com/a/678861
On Fri, Feb 5, 2016 at 1:18 PM, Nirav
short answer: PySpark does not support UDAF (user defined aggregate
function) for now.
On Tue, Feb 9, 2016 at 11:44 PM, Viktor ARDELEAN
wrote:
> Hello,
>
> I am using following transformations on RDD:
>
> rddAgg = df.map(lambda l: (Row(a = l.a, b= l.b, c = l.c), l))\
>
Michael,
Thanks for the reply.
On Wed, Feb 10, 2016 at 11:44 AM, Michael Armbrust
wrote:
> My question is that is "NOSCAN" option a must? If I execute "ANALYZE TABLE
>> compute statistics" command in Hive shell, is the statistics
>> going to be used by SparkSQL to
I am using spark-1.6.0 and java. I created a cluster using spark-ec2. I am
having a heck of time figuring out how to write from my streaming app to AWS
s3. I should mention I have never used s3 before and am not sure it is set
up correctly.
org.apache.hadoop.fs.s3.S3Exception:
Hi,
I have a bunch of files stored in hdfs /unit_files directory in total
319 files
scala> val errlog = sc.textFile("/unix_files/*.ksh")
scala> errlog.filter(line => line.contains("sed"))count()
res104: Long = 1113
So it returns 1113 instances the word "sed"
If I want to see the collection
It's a pair because there's a key and value for each message.
If you just want a single topic, put a single topic in the map of topic ->
number of partitions.
See
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
On
Hi,
We want to add spark-solr repo (https://github.com/LucidWorks/spark-solr)
to the spark-packages.org but it is currently failing due to "Cannot find
README.md" (http://spark-packages.org/staging?id=882)
We use adoc for our internal and external documentation and we are
wondering if
You can't. The number of cores must be great than the number of receivers.
On Wed, Feb 10, 2016 at 2:34 AM, ajay garg wrote:
> Hi All,
> I am running 3 executors in my spark streaming application with 3
> cores per executors. I have written my custom receiver
What kind of steps exists when reading ORC format on Spark-SQL?
I meant usually reading csv file is just directly reading the dataset on
memory.
But I feel like Spark-SQL has some steps when reading ORC format.
For example, they have to create table to insert the dataset? and then they
insert the
Hi Mich,
If you would like to print everything to the console you could - errlog.
filter(line => line.contains("sed"))collect()foreach(println)
or you could always save to a file using any of the saveAs methods.
Thanks,
Chandeep
On Wed, Feb 10, 2016 at 8:14 PM, <
I'm working with Spark 1.5.0, and I'm using the Scala API to construct
DataFrames and perform operations on them. My application requires that I
synthesize column names for intermediate results under some circumstances,
and I don't know what the rules are for legal column names. In particular,
Hi Chandeep
Many thanks for your help
In the line below
errlog.filter(line => line.contains("sed"))collect()foreach(println)
Can you please clarify the components with the correct naming as I am
new to Scala
* errlog --> is the RDD?
* filter(line =>
Hi there,
I am trying to create a listener for my Spark job to do some additional
notifications for failures using this Scala API:
https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.scheduler.JobResult
.
My idea was to write something like this:
override def onJobEnd(jobEnd:
bq. I followed something similar $"a.x"
Please use expr("...")
e.g. if your DataSet has two columns, you can write:
ds.select(expr("_2 / _1").as[Int])
where _1 refers to first column and _2 refers to second.
On Tue, Feb 9, 2016 at 3:31 PM, Raghava Mutharaju wrote:
What Partitioner do you use ?
Have you tried using RangePartitioner ?
Cheers
On Wed, Feb 10, 2016 at 3:54 PM, daze5112 wrote:
> Hi im trying to improve the performance of some code im running but have
> noticed that my distribution of my RDD across executors isn't
Hi,
This is really weird. I checked my code that I only have List[Boolean] of 7
items, the default behavior should be overwrite.
I even added overwrite after the column name in SomeColumns definition but
result still shows List<<77>>, etc.
It seems that in some way it ignores overwrite and just
Hi,
//Loading all the DB Properties
val options1 = Map("url" ->
"jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbname","user"->"username","password"->"password","dbtable"
-> "TESTCONDITIONS")
val testCond = sqlContext.load("jdbc",options1 )
val condval = testCond.select("Cond")
testCond.show()
val
Hi Spark Users,
I’m running Spark jobs on Mesos, and sometimes I get vast number of Task
Scheduler Errors:
ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 1161
because its task set is gone (this is likely the result of receiving
duplicate task finished status updates)T
It
Hi Nirav,
I faced similar issue with Yarn, EMR 1.5.2 and following
Spark Conf helped me. You can set the values accordingly
conf= (SparkConf().set("spark.master","yarn-client").setAppName("HalfWay"
).set("spark.driver.memory", "15G").set("spark.yarn.am.memory","15G"))
Hello All,
I am planning on taking Spark Certification and I was wondering If one has
to be well equipped with MLib & GraphX as well or not ?
Please advise
Thanks
How are you submitting/running the job - via spark-submit or as a plain old
Java program?
If you are using spark-submit, you can control the memory setting via the
configuration parameter spark.executor.memory in spark-defaults.conf.
If you are running it as a Java program, use -Xmx to set the
Hi, pyspark experts,
I'm trying to implement a naive Bayes lib with the same interface of
pyspark.mllib.classification.NaiveBayes. train() and predict() will be the
interfaces.
I finished the train(LabeledPoint), but got trouble in predict() due
to SPARK-5063 issue.
*Exception: It appears that
Hello,
I want to add a new String column to the dataframe based on an existing
column values:
from pyspark.sql.functions import lit
df.withColumn('strReplaced', lit(df.str.replace("a", "b").replace("c", "d")))
So basically I want to add a new column named "strReplaced", that is
the same as the
Hi,
I am new bee to Spark and using Spark 1.5.2 version.
I am trying to connect to Oracle DB using Spark API,getting errors :
Steps I followed :
Step 1- I placed the ojdbc6.jar in
/usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar
Step 2- Registered the jar file
Hi Divya,
You need to install the Oracle jdbc driver on the cluster into lib folder.
> On 10/02/2016, at 09:37, Divya Gehlot wrote:
>
> oracle.jdbc.driver.OracleDrive
Hi Ajay,
Have you overridden Receiver#preferredLocation method in your custom
Receiver? You can specify hostname for your Receiver. Check the
ReceiverSchedulingPolicy#scheduleReceivers, it should honor your
preferredLocation value for Receiver scheduling.
On Wed, Feb 10, 2016 at 4:04 PM, ajay
Have you tried adding hbase client jars to spark.executor.extraClassPath ?
Cheers
On Wed, Feb 10, 2016 at 12:17 AM, Prabhu Joseph
wrote:
> + Spark-Dev
>
> For a Spark job on YARN accessing hbase table, added all hbase client jars
> into spark.yarn.dist.files,
Hi All,
I am running 3 executors in my spark streaming application with 3
cores per executors. I have written my custom receiver for receiving network
data.
In my current configuration I am launching 3 receivers , one receiver per
executor.
In the run if 2 of my executor dies, I am left
Hi Viktor,
Try to create a UDF. It's quite simple!
Ardo.
> On 10 Feb 2016, at 10:34, Viktor ARDELEAN wrote:
>
> Hello,
>
> I want to add a new String column to the dataframe based on an existing
> column values:
>
> from pyspark.sql.functions import lit
>
ASFIK sc.addJar() will add the jars to executor's classpath . The
datasource resolution ( createRelation) happens at driver side and driver
classpath should contain the ojdbc6.jar. You can use
"spark.driver.extraClassPath"
config parameter to set the same.
On Wed, Feb 10, 2016 at 3:08 PM, Jorge
Hi
I work with pyspark & spark 1.5.2
Currently saving rdd into csv file is very very slow , uses 2% CPU only
I use :
my_dd.write.format("com.databricks.spark.csv").option("header",
"false").save('file:///my_folder')
Is there a way to save csv faster ?
Many thanks
Yes Ted, spark.executor.extraClassPath will work if hbase client jars is
present in all Spark Worker / NodeManager machines.
spark.yarn.dist.files is the easier way, as hbase client jars can be copied
from driver machine or hdfs into container / spark-executor classpath
automatically. No need to
+ Spark-Dev
For a Spark job on YARN accessing hbase table, added all hbase client jars
into spark.yarn.dist.files, NodeManager when launching container i.e
executor, does localization and brings all hbase-client jars into executor
CWD, but still the executor tasks fail with ClassNotFoundException
On 10 Feb 2016, at 04:42, praveen S
> wrote:
Hi,
I have 2 questions when running the spark jobs on yarn in client mode :
1) Where is the AM(application master) created :
in the cluster
A) is it created on the client where the job was
> On 10 Feb 2016, at 10:56, Eli Super wrote:
>
> Hi
>
> I work with pyspark & spark 1.5.2
>
> Currently saving rdd into csv file is very very slow , uses 2% CPU only
>
> I use :
> my_dd.write.format("com.databricks.spark.csv").option("header",
>
Hi,
The writes, in terms of number of records written simultaneously, can be
increased if you increased the number of partitions. You can try to
increase the number of partitions and check out how it works. There is
though an upper cap (the one that I faced in Ubuntu) on the number of
parallel
I have no problems when submitting the task using spark-submit. The --jars
option with the list of jars required is successful and I see in the output
the jars being added:
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at
I figured it out.
Here is how it's done:
from pyspark.sql.functions import udf
replaceFunction = udf(lambda columnValue : columnValue.replace("\n", "
").replace('\r', " "))
df.withColumn('strReplaced', replaceFunction(df["str"]))
On 10 February 2016 at 13:04, wrote:
> Hi
On Wed, Feb 10, 2016 at 5:20 PM, Steve Loughran
wrote:
>
> On 10 Feb 2016, at 04:42, praveen S wrote:
>
> Hi,
>
> I have 2 questions when running the spark jobs on yarn in client mode :
>
> 1) Where is the AM(application master) created :
>
>
> in
Why not use the save method from the RandomForestModel class to save a model at
a specified path?
Mohammed
Author: Big Data Analytics with Spark
-Original Message-
From: jluan [mailto:jaylu...@gmail.com]
Sent: Wednesday, February 10, 2016 5:57 PM
To: user@spark.apache.org
Thanks for the reply, I'd like to export the decision splits for each tree
out to an external file which is read elsewhere not using spark. As far as
I know, saving a model to a path will save a bunch of binary files which
can be loaded back into spark. Is this correct?
On Feb 10, 2016 7:21 PM,
Thanks a lot Ted.
If the two columns are of different types say Int and Long, then will be
ds.select(expr("_2 / _1").as[(Int, Long)])
Regards,
Raghava.
On Wed, Feb 10, 2016 at 5:19 PM, Ted Yu wrote:
> bq. I followed something similar $"a.x"
>
> Please use expr("...")
>
Hi everyone, new to this list and Spark, so I'm hoping someone can point me
in the right direction.
I'm trying to perform this same sort of task:
http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
and I'm running into the same problem - it doesn't
Yes, a model saved with the save method can be read back only by the load
method in the RandomForestModel object.
Unfortunately, I don’t know any better mechanism for what you are trying to do.
There was a discussion on this topic a few days ago, so if you search the
mailing list archives, you
Mich:
When you execute the statements in Spark shell, you would see the types of
the intermediate results.
scala> val errlog = sc.textFile("/home/john/s.out")
errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out
MapPartitionsRDD[1] at textFile at :24
scala> val sed = errlog.filter(line =>
Hi im trying to improve the performance of some code im running but have
noticed that my distribution of my RDD across executors isn't exactly even
(see pic below). Im using yarn and kicking it off with 11 executors. Not
sure how to get a more even spread or if this is normal. thanks
Please see this thread:
http://search-hadoop.com/m/q3RTtvxWU21wl78x1=Re+Spark+job+submission+REST+API
On Wed, Feb 10, 2016 at 3:37 PM, Tracy Li wrote:
> Hi Spark Experts,
>
> I am new for spark and we have requirements to support spark job, jar, sql
> etc(submit, manage).
Hi Mich,
your assumptions 1 to 3 are all correct (nitpick: they're method
*calls*, the methods being the part before the parentheses, but I
assume that's what you meant). The last one is also a method call but
uses syntactic sugar on top: `foreach(println)` boils down to
`foreach(line =>
Hi Spark Experts,
I am new for spark and we have requirements to support spark job, jar, sql
etc(submit, manage).
So far I did not find any rest API bundled by spark. there have some third
party lib already supported:
https://github.com/spark-jobserver/spark-jobserver
I want to confirm, is
Many thanks Jakob.
So it basically boils down to this demarcation as suggested which looks
clearer
val errlog = sc.textFile("/unix_files/*.ksh")
errlog.filter(line => line.contains("sed")).collect().foreach(line =>
println(line))
Regards,
Mich
On 10/02/2016 23:21, Jakob Odersky wrote:
Exactly!
As a final note, `foreach` is also defined on RDDs. This means that
you don't need to `collect()` the results into an array (which could
give you an OutOfMemoryError in case the RDD is really really large)
before printing them.
Personally, when I learn using a new library, I like to look
In Yarn we have following settings enabled so that job can use virtual
memory to have a capacity beyond physical memory off course.
yarn.nodemanager.vmem-check-enabled
false
yarn.nodemanager.pmem-check-enabled
false
vmem to pmem ration is 2:1. However spark
Thanks Ted.
That's means we have to use either livy or job-server if we want to go with
REST API.
Thanks,
Tracy
On Wed, Feb 10, 2016 at 4:13 PM, Ted Yu wrote:
> Please see this thread:
>
>
> http://search-hadoop.com/m/q3RTtvxWU21wl78x1=Re+Spark+job+submission+REST+API
>
>
Another alternative:
rdd.take(1000).drop(100) //this also preserves ordering
Note however that this can lead to an OOM if the data you're taking is
too large. If you want to perform some operation sequentially on your
driver and don't care about performance, you could do something
similar as
We have been trying to solve memory issue with a spark job that processes
150GB of data (on disk). It does a groupBy operation; some of the executor
will receive somehwere around (2-4M scala case objects) to work with. We
are using following spark config:
"executorInstances": "15",
I've trained a RandomForest classifier where I can print my model's decisions
using model.toDebugString
However I was wondering if there's a way to extract tree programmatically by
traversing the nodes or in some other way such that I can write my own
decision file rather than just a debug
65 matches
Mail list logo