Hi,
I get this error when trying to write Spark DataFrame to Hive Table Stored
as TextFile
sqlContext.sql('INSERT OVERWRITE TABLE analytics.client_view_stock *(hive
table)* SELECT * FROM client_view_stock'*(spark temp table)*')
Erro:
15/11/30 21:40:14 INFO latency: StatusCode=[404],
Hi Eyal,
what you're seeing is not a Spark issue, it is related to boxed types.
I assume 'b' in your code is some kind of java buffer, where b.getDouble()
returns an instance of java.lang.Double and not a scala.Double. Hence
muCouch is an Array[java.lang.Double], an array containing boxed
Hi Jacek,
To run a spark master on my windows box, I've created a .bat file with contents
something like:
.\bin\spark-class.cmd org.apache.spark.deploy.master.Master --host
For the worker:
.\bin\spark-class.cmd org.apache.spark.deploy.worker.Worker spark://:7077
To wrap these in services,
So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node
cluster with 32 cores each node and 64 gig memory) and my parallelism,
executor.instances, executor.cores and executor memory settings are also
fairly reasonable (600, 20, 30, 48gigs).
However my job invariably fails
Hello,
I use Spark 1.4.1 and Hadoop 2.2.0.
It may be a stupid question but I cannot understand why "dfs.blocksize" in
hadoop option doesn't affect the number of blocks sometimes.
When I run the script below,
val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
Hi,
Latency isn't such a big issue as it sounds. Did you try it out and failed
some performance metrics?
In short, the *Mesos* executor on a given slave is going to be long-running
(consuming memory, but no CPUs). Each Spark task will be scheduled using
Mesos CPU resources, but they don't suffer
On Fri, Nov 27, 2015 at 4:27 PM, Shuo Wang wrote:
> I am trying to use the start-master.sh script on windows 7.
>From http://spark.apache.org/docs/latest/spark-standalone.html:
"Note: The launch scripts do not currently support Windows. To run a
Spark cluster on Windows,
Hi,
I've built a customer receiver and deployed an application using this
receiver to a cluster.
When I run the application locally I see the log output my logger in
Stdout/Stderr but when I run it on the cluster I don't see the log output
in Stdout/Stderr.
I just see the logs in the constructor
Hi,
Yes, that's possible -- I'm doing it every day in local and standalone modes.
Just use SPARK_PRINT_LAUNCH_COMMAND=1 before any Spark command, i.e.
spark-submit, spark-shell, to know the command to start it:
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
SPARK_PRINT_LAUNCH_COMMAND
If you had exactly 1 message in the 0th topicpartition, to read it you
would use
OffsetRange("topicname", 0, 0, 1)
Kafka's simple shell consumer in that case would print
next offset = 1
So instead trying to consume
OffsetRange("topicname", 0, 1, 2)
shouldn't be expected to work
On Sat,
Then,,, something is wrong in my code ;), thanks.
2015-11-30 16:46 GMT+01:00 Cody Koeninger :
> Starting from the checkpoint using getOrCreate should be sufficient if all
> you need is at-least-once semantics
>
>
>
Hello,
I have Spark and Kafka with directStream. I'm trying that if Spark dies it
could process all those messages when it starts. The offsets are stored in
chekpoints but I don't know how I could say to Spark to start in that point.
I saw that there's another createDirectStream method with a
Hi, Lulian:
Are you sure that it'll be a long running process in fine-grained mode? I
think you have a misunderstanding about it. An executor will be launched
for some tasks, but not a long running process. When a group of tasks
finished, it will get shutdown.
On Mon, Nov 30, 2015 at 6:25 PM
Can you post the relevant code?
On Fri, Nov 27, 2015 at 4:25 AM, u...@moosheimer.com
wrote:
> Hi,
>
> we have some strange behavior with KafkaUtils DirectStream and the size of
> the MapPartitionsRDDs.
>
> We use a permanent direct steam where we consume about 8.500 json
>
Starting from the checkpoint using getOrCreate should be sufficient if all
you need is at-least-once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Mon, Nov 30, 2015 at 9:38 AM, Guillermo Ortiz
wrote:
> Hello,
>
> I have
Have you seen this thread ?
http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+
which mentioned SPARK-11105
FYI
2015-11-30 9:00 GMT-08:00 Matthias Niehoff
:
> Hi,
>
> I've built a customer receiver and deployed an application
Pinging again ...
On Wed, Nov 25, 2015 at 4:19 PM, Ted Yu wrote:
> Which Spark release are you using ?
>
> Please take a look at:
> https://issues.apache.org/jira/browse/SPARK-5594
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:59 PM, Spark Newbie
>
Is there a way to get the Job Id for a query submitted via the Spark Thrift
Server? This would allow checking the status of that specific job via the
History Server.
Currently, I'm getting status of all jobs, and then filtering the results.
Looking for a more efficient approach.
Test environment
Hi,
Maven uses its own repo as does sbt. To cross the repo boundaries use
the following:
resolvers += Resolver.mavenLocal
in your build.sbt or any other build definition as described in
http://www.scala-sbt.org/0.13/tutorial/Library-Dependencies.html#Resolvers.
You did it so let's give the
Hi,
My understanding of Spark on YARN and even Spark in general is very
limited so keep that in mind.
I'm not sure why you compare yarn-cluster and spark standalone? In
yarn-cluster a driver runs on a node in the YARN cluster while spark
standalone keeps the driver on the machine you launched a
Hi,
I'd call it a known issue on Windows, and have no solution, but using
SPARK_LOCAL_HOSTNAME or SPARK_LOCAL_IP before starting pyshell to
*work it around*.
I wished I had access to Win7 to work on it longer and find a decent
solution (not a workaround).
If you have Scala REPL, execute
I am not expert in Parquet.
Looking at PARQUET-166, it seems that parquet.block.size should be lower
than dfs.blocksize
Have you tried Spark 1.5.2 to see if the problem persists ?
Cheers
On Mon, Nov 30, 2015 at 1:55 AM, Jung wrote:
> Hello,
> I use Spark 1.4.1 and Hadoop
Hi ,
I have problem with inferring what are the types bug here
I have this code fragment . it parse Json to Array[Double]
*val muCouch = { val e = input.filter( _.id=="mu")(0).content() val
b = e.getArray("feature_mean") for (i <- 0 to e.getInt("features") )
yield
Hi Mark,
I said I've only managed to develop a limited understanding of how
Spark works in the different deploy modes ;-)
But somehow I thought that cluster in spark standalone is not
supported. I think I've seen a JIRA with a change quite recently where
it was said or something similar. Can't
Standalone mode also supports running the driver on a cluster node. See
"cluster" mode in
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
. Also,
http://spark.apache.org/docs/latest/spark-standalone.html#high-availability
On Mon, Nov 30, 2015 at 9:47 AM,
On Fri, Nov 27, 2015 at 12:12 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote:
> Hi all,
> I'm trying to understand how yarn-client mode works and found these two
> diagrams:
>
>
>
>
> In the first diagram, it looks like the driver running in client directly
> communicates with
AFAIK the ContextCleaner should perform all of the cleaning *as long as
garbage collection is performed frequently enough on the driver*. See
https://issues.apache.org/jira/browse/SPARK-7689 and
https://github.com/apache/spark/pull/6220#issuecomment-102950055 for
discussion of this technicality.
This looks interesting, thanks Ruslan. But, compaction with Hive is as
simple as an insert overwrite statement as Hive
supports CombineFileInputFormat, is it possible to do the same with Spark?
On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov
wrote:
> An interesting
Hi Cody,
What if the Offsets that are tracked are not present in Kafka. How do I
skip those offsets and go to the next Offset? Also would specifying
rebalance.backoff.ms be of any help?
Thanks,
Swteha
On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger wrote:
> To be blunt, if
You'd need to get the earliest or latest available offsets from kafka,
whichever is most appropriate for your situation.
The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
try adjusting that to get a longer sleep before retrying the task.
On Mon, Nov 30, 2015 at 1:50 PM,
Hi
i have a javaPairRdd rdd1. i want to group by rdd1 by keys but
preserve the partitions of original rdd only to avoid shuffle since I know
all same keys are already in same partition.
PairRdd is basically constrcuted using kafka streaming low level consumer
which have all records with
Is there a way to get the Job Id for a query submitted via the Spark Thrift
Server? This would allow checking the status of that specific job via the
History Server.
Currently, I'm getting status of all jobs, and then filtering the results.
Looking for a more efficient approach.
Test environment
Please see the following which went to Hive 1.1 :
HIVE-8839 Support "alter table .. add/replace columns cascade"
FYI
On Mon, Nov 30, 2015 at 2:14 PM, Daniel Lopes
wrote:
> Hi,
>
> I get this error when trying to write Spark DataFrame to Hive Table Stored
> as TextFile
It should work with 1.5+.
On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote:
>
> Hi folks,
>
> Does anyone know whether the Grid Search capability is enabled since the
> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
> column doesn't exist" when
Unless it’s a network camera with the ability to request specific frame numbers
for read, the answer is that you will just read from the camera like you
normally would without Spark inside of a foreachrdd() and parallelize the
result out for processing once you have it in a collection in the
Is there any mechanism in the kafka streaming source to specify the exact
partition id that we want a streaming job to consume from?
If not, is there a workaround besides writing our a custom receiver?
Thanks,
- Alan
Hi,
I am trying to use predicate pushdown in ORC and I was expecting that it
would be used when one tries to query an ORC table in Hive. But based on
the query plan generated, I don't see that happening. I am missing some
configurations or this is not supported at all.
PS: I have tried the
This is related:
SPARK-11087
FYI
On Mon, Nov 30, 2015 at 5:56 PM, Tejas Patil
wrote:
> Hi,
> I am trying to use predicate pushdown in ORC and I was expecting that it
> would be used when one tries to query an ORC table in Hive. But based on
> the query plan generated,
Hi Joseph,
Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
"rawPredictionCol field does not exist exception" on Spark 1.5.2 for
Gradient Boosting Trees classifier.
Ardo
On Tue, 1 Dec 2015 at 01:34, Joseph Bradley wrote:
> It should work with
Yes, I can reproduce it in Spark 1.5.2.
This is the results.
1. first case(1block)
221.1 M
/user/hive/warehouse/partition_test/part-r-0-b0e5ecd3-75a3-4c92-94ec-59353d08067a.gz.parquet
221.1 M
Hi,
So, our Streaming Job fails with the following errors. If you see the errors
below, they are all related to Kafka losing offsets and
OffsetOutOfRangeException.
What are the options we have other than fixing Kafka? We would like to do
something like the following. How can we achieve 1 and 2
So, our Streaming Job fails with the following errors. If you see the
errors(highlighted in blue below), they are all related to Kafka losing
offsets and OffsetOutOfRangeException.
What are the options we have other than fixing Kafka? We would like to do
something like the following. How can we
Hello! Can anyone guide me please, on how to capture video from a camera with
spark streaming ? any article or book to read to recommend me ?
thank you
Hi,
I am reading data from Kafka into spark. It runs fine for sometime but
then hangs forever with following output. I don't see and errors in logs.
How do I debug this?
2015-12-01 06:04:30,697 [dag-scheduler-event-loop] INFO (Logging.scala:59)
- Adding task set 19.0 with 4 tasks
2015-12-01
You could use the number of input files to determine the number of output
partitions. This assumes your input file sizes are deterministic.
Else, you could also persist the RDD and then determine it's size using the
apis.
Regards
Sab
On 26-Nov-2015 11:13 pm, "Nezih Yigitbasi"
Hi, I was running the SparkPi example code and studying their performance
difference using different number of cores per worker. I change the number
of cores by using start-slave.sh -c CORES on the worker machine for
distributed computation. I also use spark-submit --master local[CORES] for
the
Hi Ndjido,
This is because GBTClassifier doesn't yet have a rawPredictionCol like the.
RandomForestClassifier has.
Cf:
http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" wrote:
> Hi Joseph,
>
> Yes
Hi Benjamin,
Thanks, the documentation you sent is clear.
Is there any other way to perform a Grid Search with GBT?
Ndjido
On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet
wrote:
> Hi Ndjido,
>
> This is because GBTClassifier doesn't yet have a rawPredictionCol like
>
Thanks TD !!
I think this should solve my purpose.
On Sun, Nov 29, 2015 at 6:17 PM, Tathagata Das wrote:
> You can get the batch start (the expected, not the exact time when the
> jobs are submitted) from DStream operation "transform". There is a version
> of transform
ahhh I get it thx!!
I did not know that we can use "double index"
I used x[0] to point on shows, x[1][0] to point on channels x[1][1] to point
on views.
I feel terribly noob.
Thank you all :)
--
View this message in context:
Hi,
I'm working on storing effectively what is a session that receives its
close event using spark streaming, by using updateStateByKey.
private static Function2
COLLECTED_SESSION = (newItems, current) -> {
SessionUpdate returnValue = current.orNull();
Hi,
I have an time critical spark application, which is taking sensor data from
kafka stream, storing in case class, applying transformations and then storing
in cassandra schema. The data needs to be stored in schema, in FIFO order.
The order is maintained at kafka queue but I am observing,
Although writing a custom UnaryTransformer is not difficult, but writing a
non-UnaryTransformer is a little tricky (have to check the source code).
And I don't find any document about how to write custom Transformer in ml
pipeline, but writing custom Transformer is a very basic requirement. Is
BTW, my spark.python.worker.reuse setting is set to "true".
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520p25521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark Streaming will consumer and process data in parallel. So the order of
the output will depend not only on the order of the input but also in the
time it takes for each task to process. Different options, like
repartitions, sorts and shuffles at Spark level will also affect ordering,
so the
55 matches
Mail list logo