withColumn on nested schema

2018-06-13 Thread Zsolt Tóth
Hi, I'm trying to replace values in a nested column in a JSON-based dataframe using withColumn(). This syntax works for select, filter, etc, giving only the nested "country" column: df.select('body.payload.country') but if I do this, it will create a new column with the name

Map and MapParitions with partition-local variable

2016-11-17 Thread Zsolt Tóth
Any comment on this one? 2016. nov. 16. du. 12:59 ezt írta ("Zsolt Tóth" <toth.zsolt@gmail.com>): > Hi, > > I need to run a map() and a mapPartitions() on my input DF. As a > side-effect of the map(), a partition-local variable should be updated, > th

Map and MapParitions with partition-local variable

2016-11-16 Thread Zsolt Tóth
Hi, I need to run a map() and a mapPartitions() on my input DF. As a side-effect of the map(), a partition-local variable should be updated, that is used in the mapPartitions() afterwards. I can't use Broadcast variable, because it's shared between partitions on the same executor. Where can I

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Zsolt Tóth
based on the renew-interval instead of the max-lifetime? 2016-11-04 2:37 GMT+01:00 Marcelo Vanzin <van...@cloudera.com>: > On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth <toth.zsolt@gmail.com> > wrote: > > What is the purpose of the delegation token renewal (the one that i

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
extend its lifetime. The feature you're talking about is for > creating *new* delegation tokens after the old ones expire and cannot > be renewed anymore (i.e. the max-lifetime configuration). > > On Thu, Nov 3, 2016 at 2:02 PM, Zsolt Tóth <toth.zso

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
definitely exists and people definitely have run into it. So > if you're not hitting it, it's most definitely an issue with your test > configuration. > > On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth <toth.zsolt@gmail.com> > wrote: > > Hi, > > > > I ran some t

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Any ideas about this one? Am I missing something here? 2016-11-03 15:22 GMT+01:00 Zsolt Tóth <toth.zsolt@gmail.com>: > Hi, > > I ran some tests regarding Spark's Delegation Token renewal mechanism. As > I see, the concept here is simple: if I give my keytab file and

Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Hi, I ran some tests regarding Spark's Delegation Token renewal mechanism. As I see, the concept here is simple: if I give my keytab file and client principal to Spark, it starts a token renewal thread, and renews the namenode delegation tokens after some time. This works fine. Then I tried to

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) Regards, Zsolt 2016-02-12 13:11 GMT+01:00 Ted Yu <yuzhih...@gmail.com>: > Can you pastebin the full error with all column types ? > > There should be a difference between some column(s). > > Cheers > > > On Feb 11, 2016, at 2

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
; > outputDF = unlabelledDF.join(predictedDF.select(“id”,”predicted”),”id”) > > On 11 February 2016 at 10:12, Zsolt Tóth <toth.zsolt@gmail.com> wrote: > >> Hi, >> >> I'd like to append a column of a dataframe to another DF (using Spark >> 1.5.2): >> >

Inserting column to DataFrame

2016-02-11 Thread Zsolt Tóth
Hi, I'd like to append a column of a dataframe to another DF (using Spark 1.5.2): DataFrame outputDF = unlabelledDF.withColumn("predicted_label", predictedDF.col("predicted")); I get the following exception: java.lang.IllegalArgumentException: requirement failed: DataFrame must have the same

create DataFrame from RDD

2015-12-02 Thread Zsolt Tóth
Hi, I have a Spark job with many transformations (sequence of maps and mapPartitions) and only one action in the end (DataFrame.write()). The transformations return an RDD, so I need to create a DataFrame. To be able to use sqlContext.createDataFrame() I need to know the schema of the Row but for

Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Zsolt Tóth
Hi, this is exactly the same as my issue, seems to be a bug in 1.5.x. (see my thread for details) 2015-11-19 11:20 GMT+01:00 Jeff Zhang : > Seems your jdbc url is not correct. Should be jdbc:mysql:// > 192.168.41.229:3306 > > On Thu, Nov 19, 2015 at 6:03 PM,

ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi, I try to throw an exception of my own exception class (MyException extends SparkException) on one of the executors. This works fine on Spark 1.3.x, 1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x. This happens only when I throw it on an executor, on the driver it

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi Tamás, the exception class is in the application jar, I'm using the spark-submit script. 2015-11-19 11:54 GMT+01:00 Tamas Szuromi <tamas.szur...@odigeo.com>: > Hi Zsolt, > > How you load the jar and how you prepend it to the classpath? > > Tamas > > > > >

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zsolt Tóth
Hi, I ran your example on Spark-1.4.1 and 1.5.0-rc3. It succeeds on 1.4.1 but throws the OOM on 1.5.0. Do any of you know which PR introduced this issue? Zsolt 2015-09-07 16:33 GMT+02:00 Zoltán Zvara : > Hey, I'd try to debug, profile ResolvedDataSource. As far as I

Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Zsolt Tóth
Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt

Re: RDD collect hangs on large input data

2015-04-17 Thread Zsolt Tóth
. On Wed, Apr 8, 2015 at 3:45 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

Re: RDD collect hangs on large input data

2015-04-08 Thread Zsolt Tóth
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

Re: Using ORC input for mllib algorithms

2015-03-30 Thread Zsolt Tóth
via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class

Re: RDD collect hangs on large input data

2015-03-30 Thread Zsolt Tóth
huge, you can simply do a count() to trigger the execution. Can you paste your exception stack trace so that we'll know whats happening? Thanks Best Regards On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I have a simple Spark application: it creates

RDD collect hangs on large input data

2015-03-27 Thread Zsolt Tóth
Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output. If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger

Using ORC input for mllib algorithms

2015-03-25 Thread Zsolt Tóth
Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, OrcStruct.class) to use data in ORC format as an RDD. I made some benchmarking on ORC input vs Text input for MLlib and I ran into a few issues with ORC. Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor

Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth
Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling Client.submitApplication(). For some reason I want to use 1.3.0 jars on the client side (e.g spark-yarn_2.10-1.3.0.jar) but I have spark-assembly-1.2.1* on the cluster. The problem is that the ApplicationMaster can't

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
One more question: Is there reason why Spark throws an error when requesting too much memory instead of capping it to the maximum value (as YARN would do by default)? Thanks! 2015-02-10 17:32 GMT+01:00 Zsolt Tóth toth.zsolt@gmail.com: Hi, I'm using Spark in yarn-cluster mode and submit

Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
Hi, I'm using Spark in yarn-cluster mode and submit the jobs programmatically from the client in Java. I ran into a few issues when tried to set the resource allocation properties. 1. It looks like setting spark.executor.memory, spark.executor.cores and spark.executor.instances have no effect

[mllib] Decision Tree - prediction probabilites of label classes

2015-01-21 Thread Zsolt Tóth
Hi, I use DecisionTree for multi class classification. I can get the probability of the predicted label for every node in the decision tree from node.predict().prob(). Is it possible to retrieve or count the probability of every possible label class in the node? To be more clear: Say in Node A