This could be the workaround:
http://stackoverflow.com/a/36419857
On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel"
> wrote:
Thanks Yanbo and Felix.
I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built
for
thanks makes sense, can anyone answer this below question ?
http://apache-spark-user-list.1001560.n3.nabble.com/spark-parquet-too-many-small-files-td27264.html
Thanks
Sri
On Tue, Jul 5, 2016 at 8:15 PM, Saisai Shao wrote:
> It is not worked to configure local dirs to
It is not worked to configure local dirs to HDFS. Local dirs are mainly
used for data spill and shuffle data persistence, it is not suitable to use
hdfs. If you met capacity problem, you could configure multiple dirs
located in different mounted disks.
On Wed, Jul 6, 2016 at 9:05 AM, Sri
Hi ,
Space issue we are currently using /tmp and at the moment we don't have any
mounted location setup yet.
Thanks
Sri
Sent from my iPhone
> On 5 Jul 2016, at 17:22, Jeff Zhang wrote:
>
> Any reason why you want to set this on hdfs ?
>
>> On Tue, Jul 5, 2016 at 3:47
Any reason why you want to set this on hdfs ?
On Tue, Jul 5, 2016 at 3:47 PM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All,
>
> can I set spark.local.dir to HDFS location instead of /tmp folder ?
>
> I tried setting up temp folder to HDFS but it didn't worked can
>
Hi Alexander,
I used the same example from MLP user guide but on Java language.
I modified an example a little bit and there was my nasty bug that I haven’t
noticed (some inconsistence between layers and real feature count).
After fixing that MLP works on my test.
So here is my inadvertence,
Hi All,
can I set spark.local.dir to HDFS location instead of /tmp folder ?
I tried setting up temp folder to HDFS but it didn't worked can
spark.local.dir write to HDFS ?
.set("spark.local.dir","hdfs://namednode/spark_tmp/")
16/07/05 15:35:47 ERROR DiskBlockManager: Failed to create local
Just to be clear Spark 2.0 isn't released yet, there is a preview version
for developers to explore and test compatibility with. That being said Roy
Hasson has a blog post discussing using Spark 2.0-preview with EMR -
Hi Biplob,
The current Streaming KMeans code only updates data which comes in through
training (e.g. trainOn), predictOn does not update the model.
Cheers,
Holden :)
P.S.
Traffic on the list might be have been bit slower right now because of
Canada Day and 4th of July weekend respectively.
I recently got a sales email from SnappyData, and after reading the
documentation about what they offer, it sounds very similar to what Structured
Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD
compliant data storage in SnappyData. I was wondering if Structured Streaming
Well that is what the OP stated.
I have a spark cluster consisting of 4 nodes in a standalone mode,..
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi Mikhail,
I have followed the MLP user-guide and used the dataset and network
configuration you mentioned. MLP was trained without any issues with default
parameters, that is block size of 128 and 100 iterations.
Source code:
scala> import
Did the OP say he was running a stand alone cluster of Spark, or on Yarn?
> On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh
> wrote:
>
> Hi Jakub,
>
> Any reason why you are running in standalone mode, given that your are
> familiar with YARN?
>
> In theory your
Hi,
Yes I tried that however, I also want to "pin down" that specific event
containing invalid characters in the column names (per the parquet spec)
and drop it from the df before converting it to parquet.
Where I'm having trouble is my dataframe might have events with different
set of fields,
Test by producing messages into kafka at a rate comparable to what you
expect in production.
Test with backpressure turned on, it doesn't require you to specify a
fixed limit on number of messages and will do its best to maintain
batch timing. Or you could empirically determine a reasonable
Hi, thanks.
I know about possibility to limit number of messages. But the problem is
I don't know number of messages which the system able to process. It
depends on data. The example is very simple. I need a strict response after
specified time. Something like soft real time. In case of Flink
hi,
doing multiple filters to keep data that you need.
regards,
On Tue, Jul 5, 2016 at 5:38 PM, pseudo oduesp wrote:
> Hi ,
> how i can remove row from data frame verifying some condition on some
> columns ?
> thanks
>
--
M'BAREK Med Nihed,
Fedora Ambassador,
Hi ,
how i can remove row from data frame verifying some condition on some
columns ?
thanks
>From experience, here's the kind of things that cause the driver to run out
of memory:
- Way too many partitions (1 and up)
- Something like this:
data = load_large_data()
rdd = sc.parallelize(data)
- Any call to rdd.collect() or rdd.take(N) where the resulting data is
bigger than driver
Hi Jakub,
Any reason why you are running in standalone mode, given that your are
familiar with YARN?
In theory your settings are correct. I checked your environment tab
settings and they look correct.
I assume you have checked this link
http://spark.apache.org/docs/latest/spark-standalone.html
So now that we clarified that all is submitted at cluster standalone mode
what is left when the application (ML pipeline) doesn't take advantage of
full cluster power but essentially running just on master node until
resources are exhausted. Why training ml Decesion Tree doesn't scale to the
rest
If you're talking about limiting the number of messages per batch to
try and keep from exceeding batch time, see
http://spark.apache.org/docs/latest/configuration.html
look for backpressure and maxRatePerParition
But if you're only seeing zeros after your job runs for a minute, it
sounds like
Hello,
I'm trying to organize processing of messages from Kafka. And there is a
typical case when a number of messages in kafka's queue is more then
Spark app's possibilities to process. But I need a strong time limit to
prepare result for at least for a part of data.
Code example:
If it's a batch job, don't use a stream.
You have to store the offsets reliably somewhere regardless. So it sounds
like your only issue is with identifying offsets per partition? Look at
KafkaCluster.scala, methods getEarliestLeaderOffsets /
getLatestLeaderOffsets.
On Tue, Jul 5, 2016 at 7:40
On Tue, Jul 5, 2016 at 4:18 PM, Jakub Stransky wrote:
> 1) Is it possible to configure multiple executors per worker machine?
Yes.
> Do I understand it correctly that I specify SPARK_WORKER_MEMORY and
> SPARK_WORKER_CORES which essentially describes available resources
Hello,
I went through Spark documentation and several posts from Cloudera etc and
as my background is heavily on Hadoop/YARN there is a little confusion
still there. Could someone more experienced clarify please?
What I am trying to achieve:
- Running cluster in standalone mode version 1.6.1
After using Spark 1.2 for quite a long time, I have realised that you can
no longer pass spark configuration to the driver via the --conf via command
line (or in my case shell script).
I am thinking about using system properties and picking the config up using
the following bit of code:
def
Hi,
What do you think of using df.columns to know the column names and
process appropriately or df.schema?
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Tue,
Thanks for you answer. Unfortunately Im bound to Kafka 0.8.2.1.--Bruckwald
nihed mbarek írta:
>Hi, Are you using a new version of kafka ? if yessince 0.9 auto.offset.reset
>parameter take :earliest: automatically reset the offset to the earliest
>offsetlatest: automatically
Thanks Yanbo and Felix.
I tried these commands on CDH Quickstart VM and also on "Spark 1.6
pre-built for Hadoop" version. I am still not able to get it working. Not
sure what I am missing. Attaching the logs.
On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung
wrote:
Hi,
Are you using a new version of kafka ? if yes
since 0.9 auto.offset.reset parameter take :
- earliest: automatically reset the offset to the earliest offset
- latest: automatically reset the offset to the latest offset
- none: throw exception to the consumer if no previous offset
Hello, Im writing a Spark (v1.6.0) batch job which reads from a Kafka
topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
however, I need to set the offsets for all the partitions and also need to
store them somewhere (ZK? HDFS?) to know from where to start the next
Hi,
I have a question wrt ML algorithms.
What are the most network intensive algorithms in Spark MLlib?
I have already looked at ALS (as pointed here:
https://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html
ALS is pretty communication and computation
Hi,
I implemented the streamingKmeans example provided in the spark website but
in Java.
The full implementation is here,
http://pastebin.com/CJQfWNvk
But i am not getting anything in the output except occasional timestamps
like one below:
---
Hi,
I need to sort a dataframe and retrive the bounds of each partition.
The dataframe.sort() is using the range partitioning in the physical plan.
I need to retrieve partition bounds.
Many thanks for your help.
It all depends on your latency requirements and volume. 100s of queries per
minute, with an acceptable latency of up to a few seconds? Yes, you could
use Spark for serving, especially if you're smart about caching results
(and I don't mean just Spark caching, but caching recommendation results
for
Sean is correct - we now use jpmml-model (which is actually BSD 3-clause,
where old jpmml was A2L, but either work)
On Fri, 1 Jul 2016 at 21:40 Sean Owen wrote:
> (The more core JPMML libs are Apache 2; OpenScoring is AGPL. We use
> JPMML in Spark and couldn't otherwise
Hi all!
I'm having a strange issue with pyspark 1.6.1. I have a dataframe,
df = sqlContext.read.parquet('/path/to/data')
whose "df.take(10)" is really slow, apparently scanning the whole
dataset to take the first ten rows. "df.first()" works fast, as does
"df.rdd.take(10)".
I have found
you can try set "spark.shuffle.manager" to "hash".
this is the meaning of the parameter:
Implementation to use for shuffling data. There are two implementations
available:sort and hash. Sort-based shuffle is more memory-efficient and is the
default option starting in 1.2.
--
On Tue, Jul 5, 2016 at 2:16 AM, kishore kumar wrote:
> 2016-07-04 05:11:53,972 [dispatcher-event-loop-0] ERROR
> org.apache.spark.scheduler.LiveListenerBus- Dropping SparkListenerEvent
> because no remaining room in event q
> ueue. This likely means one of the
Hi Deng,
Thanks for the help. Actually I need pay more attention to memory usage.
I found the root cause in my problem. It seemed that it existed in spark
streaming MQTTUtils module.
When I use "localhost" in brokerURL, it doesn't work.
After change it to "127.0.0.1", it works now.
Thanks
Hi,
I'm trying to understand how Spark HA works. I'm using Spark 1.6.1 and
Zookeeper 3.4.6.
I've add the following line to $SPARK_HOME/conf/spark-env.sh
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181,zk3:2181
Hi Jared,
You can launch a Spark application even with just a single node in YARN,
provided that the node has enough resources to run the job.
It might also be good to note that when YARN calculates the memory
allocation for the driver and the executors, there is an additional memory
overhead
it should work
*spark-sql> create database somedb;*OK
Time taken: 2.694 seconds
*spark-sql> show databases;*OK
accounts
asehadoop
default
iqhadoop
mytable_db
oraclehadoop
*somedb*test
twitterdb
Time taken: 1.277 seconds, Fetched 9 row(s)
*spark-sql> use somedb;OK*Time taken: 0.059 seconds
Hi
I am very new to SparkSQL and I have a very basic question: How do I create
a database or multiple databases in sparkSQL. I am executing the SQL from
spark-sql CLI. The query like in hive: /create database sample_db/ does not
work here. I have Hadoop 2.7 and Spark 1.6 installed on my system.
By setting the preferSortMergeJoin to false, it still only picks between
Merge Join and Broadcast join. Does not pick shuffle hash join depending on
autobroadcastthreshold's value.
I went though the sparkstrategies, and doesn't look like there is a direct
clean way to enforce it.
On Mon, Jul 4,
Hi guys,
I set up pseudo hadoop/yarn cluster on my labtop.
I wrote a simple spark streaming program as below to receive messages with
MQTTUtils.
conf = new SparkConf().setAppName("Monitor");
jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream inputDS =
47 matches
Mail list logo