If you want to tie them with other data, I think the best way is to use
DataFrame join operation on condition that they share an identity column.
Thanks
Yanbo
2016-08-16 20:39 GMT-07:00 ayan guha :
> Hi
>
> Thank you for your reply. Yes, I can get prediction and original features
> together. My
Hi, does the version have to be the same down to the minor version, e.g.
1.6.1 and 1.6.2 will give this issue
On Thu, Aug 18, 2016 at 6:27 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:
> your executors/driver must not have the multiple versions of spark in
> classpath, it may com
you're right jean,
it's mismatch library in default oozie sharelib with my spark version.
i've replace it and it works normally now.
thx for your help Jean.
On 18/08/16 13:37, Jean-Baptiste Onofré wrote:
It sounds like a mismatch in the spark version ship in oozie and the
runtime one.
Re
@Michal
Yes, we have public VectorUDT in spark.mllib package at 1.6, and this class
is still existing in 2.0.
And from 2.0, we provide a new VectorUDT in spark.ml package and make it
private temporary (will be public in the near future).
Since from 2.0, spark.mllib package will be in maintenance m
It sounds like a mismatch in the spark version ship in oozie and the runtime
one.
Regards
JB
On Aug 18, 2016, 07:36, at 07:36, tkg_cangkul wrote:
>hi olivier, thx for your reply.
>
>this is the full stacktrace :
>
>Failing Oozie Launcher, Main class
>[org.apache.oozie.action.hadoop.SparkMain
hi olivier, thx for your reply.
this is the full stacktrace :
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.SparkMain], main() threw exception,
org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V
java.lang.NoSuchMethodError:
org.apache.spark
Agreed.
Regards
JB
On Aug 18, 2016, 07:32, at 07:32, Olivier Girardot
wrote:
>CC'ing dev list, you should open a Jira and a PR related to it to
>discuss it c.f.
>https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges
>
>
>
>
>
>On W
CC'ing dev list, you should open a Jira and a PR related to it to discuss it
c.f.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingCodeChanges
On Wed, Aug 17, 2016 4:01 PM, Andrés Ivaldi iaiva...@gmail.com wrote:
Hello, I'd like to report
that's another "pipeline" step to add whereas when using persist is just
relevant during the lifetime of your jobs and not in HDFS but in the local disk
of your executors.
On Wed, Aug 17, 2016 5:56 PM, neil90 neilp1...@icloud.com wrote:
>From the spark
documentation(http://spark.apache.org/d
this is not the full stacktrace, please post the full stacktrace if you want
some help
On Wed, Aug 17, 2016 7:24 PM, tkg_cangkul yuza.ras...@gmail.com wrote:
hi i try to submit job spark with oozie. but i've got one problem here.
when i submit the same job. sometimes my job succeed but sometim
your executors/driver must not have the multiple versions of spark in classpath,
it may come from the cassandra connector check the pom dependencies of the
version you fetched and if it's compatible with your spark version.
On Thu, Aug 18, 2016 6:05 AM, thbeh th...@thbeh.com wrote:
Running the
Running the query below I have been hitting - local class incompatible
exception, anyone know the cause?
val rdd = csc.cassandraSql("""select *, concat('Q', d_qoy) as qoy from
store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on
ss_item_sk =
i_item_sk""").groupBy("i_category").piv
Can you please check order of all the data set of union all operations.
Are they in same order ?
On 9 August 2016 at 02:47, max square wrote:
> Hey guys,
>
> I'm trying to save Dataframe in CSV format after performing unionAll
> operations on it.
> But I get this exception -
>
> Exception in t
So here is the test case from the commit adding the first/last methods here:
https://github.com/apache/spark/pull/10957/commits/defcc02a8885e884d5140b11705b764a51753162
+ test("last/first with ignoreNulls") {
+val nullStr: String = null
+val df = Seq(
+ ("a", 0, nullStr),
Wondering why are you creating separate dstreams? You should apply the
logic directly on input dstream
On 18 Aug 2016 06:40, "vidhan" wrote:
> I have a *kafka* stream coming in with some input topic.
> This is the code i wrote for accepting *kafka* stream.
>
> *>>> conf = SparkConf().setAppName(a
Thanks Harsh for the reply.
When I change the code to something like this -
def saveAsLatest(df: DataFrame, fileSystem: FileSystem, bakDir: String) =
{
fileSystem.rename(new Path(bakDir + latest), new Path(bakDir + "/" +
ScalaUtil.currentDateTimeString))
fileSystem.create(new Path(bak
I have a *kafka* stream coming in with some input topic.
This is the code i wrote for accepting *kafka* stream.
*>>> conf = SparkConf().setAppName(appname)
>>> sc = SparkContext(conf=conf)
>>> ssc = StreamingContext(sc)
>>> kvs = KafkaUtils.createDirectStream(ssc, topics,\
{"metada
I'm using Spark 1.6.2 for Vector-based UDAF and this works:
def inputSchema: StructType = new StructType().add("input", new VectorUDT())
Maybe it was made private in 2.0
On 17 August 2016 at 05:31, Alexey Svyatkovskiy
wrote:
> Hi Yanbo,
>
> Thanks for your reply. I will keep an eye on that pul
The OneHotEncoder does *not* accept multiple columns.
You can use Michal's suggestion where he uses Pipeline to set the stages
and then executes them.
The other option is to write a function that performs one hot encoding on a
column and returns a dataframe with the encoded column and then call i
Hi Ted/All,
i did below to get fullstack and see below, not able to understand root
cause..
except Exception as error:
traceback.print_exc()
and this what i get...
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/context.py",
line 580, in sql
return DataFrame(self._ssql_ctx.
I had already tried this way :
scala> val featureCols = Array("category","newone")
featureCols: Array[String] = Array(category, newone)
scala> val indexer = new
StringIndexer().setInputCol(featureCols).setOutputCol("categoryIndex").fit(df1)
:29: error: type mismatch;
found : Array[String]
re
I don't think it does. From the documentation:
https://spark.apache.org/docs/2.0.0-preview/ml-features.html#onehotencoder,
I see that it still accepts one column at a time.
On Wed, Aug 17, 2016 at 10:18 AM, janardhan shetty
wrote:
> 2.0:
>
> One hot encoding currently accepts single input column
You can it just map over your columns and create a pipeline:
val columns = Array("colA", "colB", "colC")
val transformers: Array[PipelineStage] = columns.map {
x => new OneHotEncoder().setInputCol(x).setOutputCol(x + "Encoded")
}
val pipeline = new Pipeline()
.setStages(transformers)
On 17 Au
Hi
I can see that exception is caused by following, csn you check where in
your code you are using this path
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
hdfs://testcluster:8020/experiments/vol/spark_chomp_data/bak/restaurants-bak/latest
On Wed, 17 Aug 20
/bump
It'd be great if someone can point me to the correct direction.
On Mon, Aug 8, 2016 at 5:07 PM, max square wrote:
> Here's the complete stacktrace - https://gist.github.com/rohann/
> 649b0fcc9d5062ef792eddebf5a315c1
>
> For reference, here's the complete function -
>
> def saveAsLatest(
hi i try to submit job spark with oozie. but i've got one problem here.
when i submit the same job. sometimes my job succeed but sometimes my
job was failed.
i've got this error message when the job was failed :
org.apache.spark.launcher.CommandBuilderUtils.addPermGenSizeOpt(Ljava/util/List;)V
Spark Version : 1.5.0
Record:
01-Jan-16
Expected Result:
2016
I used the below code which is shared in user group.
from_unixtime(unix_timestamp($"Creation Date","dd-MMM-yy"),""))
is this right approach or do we have any other approach.
NOTE:
i tried *year() *function but it gives only nu
2.0:
One hot encoding currently accepts single input column is there a way to
include multiple columns ?
Hi folks,
Just a friendly message that we have added Python support to the REST
Spark Job Server project. If you are a Python user looking for a
RESTful way to manage your Spark jobs, please come have a look at our
project!
https://github.com/spark-jobserver/spark-jobserver
-Evan
---
I experienced very slow execution time
http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow
and wondering why...
On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung
wrote:
> This is supported in Spark 2.0.0 as dapply and gapply. Please see the API
> doc:
> https://spark.apache.or
My code is very simple, if i use other hive tables, my code works fine. This
particular table (virtual view) is huge and might have more metadata.
It has only two columns.
virtual view name is : cluster_table
# col_namedata_type
ln string
parti in
Please include user@ in your reply.
Can you reveal the snippet of hive sql ?
On Wed, Aug 17, 2016 at 9:04 AM, vr spark wrote:
> spark 1.6.1
> mesos
> job is running for like 10-15 minutes and giving this message and i killed
> it.
>
> In this job, i am creating data frame from a hive sql. There
Hi,
I am using Spark 2.0 and I am getting unexpected results using the last()
method. Has anyone else experienced this? I get the sense that last() is
working correctly within a given data partition but not across the entire RDD.
First() seems to work as expected so I can work around this by ha
>From the spark
documentation(http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
yes you can use persist on a dataframe instead of cache. All cache is, is a
shorthand for the default persist storage level "MEMORY_ONLY". If you want
to persist the dataframe to disk you shoul
W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an unknown
offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492
W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an unknown
offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493
W0816 23:17:01.985124 16360 sched.cpp
Can you provide more information ?
Were you running on YARN ?
Which version of Spark are you using ?
Was your job failing ?
Thanks
On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote:
>
> W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-3
Can you show the complete stack trace ?
Which version of Spark are you using ?
Thanks
On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote:
> Hi,
> I am getting error on below scenario. Please suggest.
>
> i have a virtual view in hive
>
> view name log_data
> it has 2 columns
>
> query_map
Hi,
I am getting error on below scenario. Please suggest.
i have a virtual view in hive
view name log_data
it has 2 columns
query_map map
parti_date int
Here is my snippet for the spark data frame
my dataframe
res=sqlcont.sql("select parti_date FROM log_data WHERE par
Hello, I'd like to report a wrong behavior of DataSet's API, I don´t know
how I can do that. My Jira account doesn't allow me to add a Issue
I'm using Apache 2.0.0 but the problem came since at least version 1.4
(given the doc since 1.3)
The problem is simple to reporduce, also the work arround,
This is supported in Spark 2.0.0 as dapply and gapply. Please see the API doc:
https://spark.apache.org/docs/2.0.0/api/R/
Feedback welcome and appreciated!
_
From: Yogesh Vyas mailto:informy...@gmail.com>>
Sent: Tuesday, August 16, 2016 11:39 PM
Subject: UDF in SparkR
Hi,
First I am not sure if I should inherit from InputDStream, or
ReceiverInputDStream. For ReceiverInputDStream, why would I want to run a
receiver on each worker nodes?
If I want to inherit InputDStream, what should I do in the comput() method?
--
Thanks,
David S.
Hi,
for small to medium size clusters I think Spark Standalone mode is a good
choice.
We are contemplating moving to Yarn as our cluster grows.
What are the pros and cons of using either please. Which one offers the best
Thanking you
Hello guys: I have a problem in loading recommend model. I have 2 models,
one is good(able to get recommend result) and another is not working. I checked
these 2 models, both are MatrixFactorizationModel object. But in the metadata,
one is a PipelineModel and another is a MatrixFactorizatio
43 matches
Mail list logo