Re: [External Sender] Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-06-18 Thread Olivier Girardot
I am also facing the same issue on my kubernetes > cluster(v1.11.5) on AWS with spark version 2.3.3, any luck in figuring out > the root cause? > > On Fri, May 3, 2019 at 5:37 AM Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi, &

Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-05-03 Thread Olivier Girardot
ed on other vendors ? Also on > the kubelet nodes did you notice any pressure on the DNS side? > > Li > > > On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> I have ~300 spark job on Kubernetes (GKE)

Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-04-29 Thread Olivier Girardot
point.sh used in the kubernetes packing) - We can add a simple step to the init container trying to do the DNS resolution and failing after 60s if it did not work But these steps won't change the fact that the driver will stay stuck thinking we're still in the case of the

Back to SQL

2018-10-03 Thread Olivier Girardot
Hi everyone, Is there any known way to go from a Spark SQL Logical Plan (optimised ?) Back to a SQL query ? Regards, Olivier.

Spark Structured Streaming and compacted topic in Kafka

2017-09-06 Thread Olivier Girardot
Hi everyone, I'm aware of the issue regarding direct stream 0.10 consumer in spark and compacted topics (c.f. https://issues.apache.org/jira/browse/SPARK-17147). Is there any chance that spark structured-streaming kafka is compatible with compacted topics ? Regards, -- *Olivier Girardot*

Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Olivier Girardot
JIRA or is there a workaround ? Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com

Re: Pyspark 2.1.0 weird behavior with repartition

2017-03-11 Thread Olivier Girardot
4, in loads > return s.decode("utf-8") if self.use_unicode else s > File > "/home/snuderl/scrappstore/virtualenv/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8'

Re: Nested ifs in sparksql

2017-01-10 Thread Olivier Girardot
d 41 level of nested if else in spark sql. I have programmed it using apis on dataframe. But it takes too much time. Is there anything I can do to improve on time here? Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
PM, Michael Gummelt mgumm...@mesosphere.io wrote: What do you mean your driver has all the dependencies packaged?  What are "all the dependencies"?  Is the distribution you use to launch your driver built with -Pmesos? On Tue, Jan 10, 2017 at 12:18 PM, Olivier Girardot < o.girar...@lateral-thoughts

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
n the final dist of my app…So everything should work in theory. On Tue, Jan 10, 2017 7:22 PM, Michael Gummelt mgumm...@mesosphere.io wrote: Just build with -Pmesos http://spark.apache.org/docs/latest/building-spark.html#building-with-mesos-support On Tue, Jan 10, 2017 at 8:56 AM, Olivier Gir

Re: Could not parse Master URL for Mesos on Spark 2.1.0

2017-01-10 Thread Olivier Girardot
ing since the same configuration but using a Spark 2.0.0 is running fine within Vagrant.Could someone please help? thanks in advance,Richard -- Abhishek J BhandariMobile No. +1 510 493 6205 (USA) Mobile No. +91 96387 93021 (IND)R & D DepartmentValent Software Inc. CAEmail: abhis...@val

Re: Spark SQL - Applying transformation on a struct inside an array

2017-01-05 Thread Olivier Girardot
by computations, but that's bound to be inefficient * or to generate bytecode using the schema to do the nested "getRow,getSeq…" and re-create the rows once transformation is applied I'd like to open an issue regarding that use case because it's not the first or last tim

Re: Help in generating unique Id in spark row

2017-01-05 Thread Olivier Girardot
+|7d33a516-5532-410...|                null||                null|2439d6db-16a2-44b...| +++ -- Thanks and Regards, Saurav Sinha Contact: 9742879062 -- Thanks and Regards, Saurav Sinha Contact: 9742879062 Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Olivier Girardot
, because the Hadoop APIs that are used are all the same across these versions. That would be the thing that makes you need multiple versions of the artifact under multiple classifiers. On Wed, Sep 28, 2016 at 1:16 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: ok, don't

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
s there by any chance publications of Spark 2.0.0 with different classifier according to different versions of Hadoop available ? Thanks for your time ! Olivier Girardot Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-21 Thread Olivier Girardot
according to different versions of Hadoop available ? Thanks for your time ! Olivier Girardot

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-16 Thread Olivier Girardot
AM, Michael Armbrust mich...@databricks.com wrote: Is what you are looking for a withColumn that support in place modification of nested columns? or is it some other problem? On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: I tried to use the RowEnco

Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-14 Thread Olivier Girardot
the same issue. It's pretty common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: Hi everyone,I'

Spark SQL - Applying transformation on a struct inside an array

2016-09-13 Thread Olivier Girardot
ly a transformation on complex nested datatypes (arrays and struct) on a Dataframe updating the value itself. Regards, Olivier Girardot

Re: Aggregations with scala pairs

2016-08-17 Thread Olivier Girardot
aggExprs).map { pairExpr => strToExpr(pairExpr._2)(df(pairExpr._1).expr) }.toSeq) } regards -- Ing. Ivaldi Andres Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark DF CacheTable method. Will it save data to disk?

2016-08-17 Thread Olivier Girardot
-method-Will-it-save-data-to-disk-tp27533p27551.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org Olivier Girardot | Associé o.g

Re: error when running spark from oozie launcher

2016-08-17 Thread Olivier Girardot
ions properties but this not help enough for me. Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark SQL 1.6.1 issue

2016-08-17 Thread Olivier Girardot
-tp27554.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ----- To unsubscribe e-mail: user-unsubscr...@spark.apache.org Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: tpcds for spark2.0

2016-07-29 Thread Olivier Girardot
gn instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org $apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD Olivier Girardot | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: OOM exception during Broadcast

2016-03-08 Thread Olivier Girardot
bjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> >>>> >>>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark >>>> property maximizeResourceAllocation is set to true (executor.memory = 48G >>>> according to spark ui environment). We're also using kryo serialization and >>>> Yarn is the resource manager. >>>> >>>> Any ideas as what might be going wrong and how to debug this? >>>> >>>> Thanks, >>>> Arash >>>> >>>> >>>> >>> >>> >> >> > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Olivier Girardot
the executor will have large data (~2M RDD objects) to >> process. I used following as executorJavaOpts but it doesn't seem to work. >> -XX:-HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -3 %p' >> -XX:HeapDumpPath=/opt/cores/spark >> >> >> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> > > > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Spark Certification

2016-02-14 Thread Olivier Girardot
om/ >>> http://sparkdeveloper.com/ >>> >>> >>> From: naga sharathrayapati >>> Date: Wednesday, February 10, 2016 at 11:36 PM >>> To: "user@spark.apache.org" >>> Subject: Spark Certification >>> >>> Hello Al

Re: Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
, Olivier. 2016-01-05 19:01 GMT+01:00 Michael Armbrust : > You could try with the `Encoders.bean` method. It detects classes that > have getters and setters. Please report back! > > On Tue, Jan 5, 2016 at 9:45 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote:

Spark 1.6 - Datasets and Avro Encoders

2016-01-05 Thread Olivier Girardot
Hi everyone, considering the new Datasets API, will there be Encoders defined for reading and writing Avro files ? Will it be possible to use already generated Avro classes ? Regards, -- *Olivier Girardot*

Re: Lookup / Access of master data in spark streaming

2015-10-05 Thread Olivier Girardot
your time Regards, 2015-10-05 23:49 GMT+02:00 Tathagata Das : > Yes, when old broacast objects are not referenced any more in the driver, > then associated data in the driver AND the executors will get cleared. > > On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot < > o.girar...

Re: Lookup / Access of master data in spark streaming

2015-10-05 Thread Olivier Girardot
ease >> destroy and notify the sender. Any use of this email is prohibited when >> received in error. Impetus does not represent, warrant and/or guarantee, >> that the integrity of this communication has been maintained nor that the >> communication is free of errors, virus, interception or interference. >> > > > -- > > > > > > > NOTE: This message may contain information that is confidential, > proprietary, privileged or otherwise protected by law. The message is > intended solely for the named addressee. If received in error, please > destroy and notify the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-08-31 Thread Olivier Girardot
n$8$$anon$1.next(Window.scala:252) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 2015-08-26 11:47 GMT+02:00 Olivier Girardot : > Hi everyone, > I know this "post title" doesn't seem very logical and I agree, > we have a very complex computation usin

Re: Spark stages very slow to complete

2015-08-25 Thread Olivier Girardot
>>> any advice for me? >>> >>> Thanks! >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94

Re: Classifier for Big Data Mining

2015-07-21 Thread Olivier Girardot
depends on your data and I guess the time/performance goals you have for both training/prediction, but for a quick answer : yes :) 2015-07-21 11:22 GMT+02:00 Chintan Bhatt : > Which classifier can be useful for mining massive datasets in spark? > Decision Tree can be good choice as per scalabilit

Re: Check for null in PySpark DataFrame

2015-07-01 Thread Olivier Girardot
I must admit I've been using the same "back to SQL" strategy for now :p So I'd be glad to have insights into that too. Le mar. 30 juin 2015 à 23:28, pedro a écrit : > I am trying to find what is the correct way to programmatically check for > null values for rows in a dataframe. For example, bel

Re: coalesce on dataFrame

2015-07-01 Thread Olivier Girardot
PySpark or Spark (scala) ? When you use coalesce with anything but a column you must use a literal like that in PySpark : from pyspark.sql import functions as F F.coalesce(df.a, F.lit(True)) Le mer. 1 juil. 2015 à 12:03, Ewan Leith a écrit : > It's in spark 1.4.0, or should be at least: > > ht

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
I would pretty much need exactly this kind of feature too Le ven. 26 juin 2015 à 21:17, Dave Ariens a écrit : > Hi Timothy, > > > > Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I > need to ensure that my tasks running on the slaves perform a Kerberos login > before acc

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-25 Thread Olivier Girardot
Nop I have not but I'm glad I'm not the only one :p Le ven. 26 juin 2015 07:54, Tao Li a écrit : > Hi Olivier, have you fix this problem now? I still have this fasterxml > NoSuchMethodError. > > 2015-06-18 3:08 GMT+08:00 Olivier Girardot < > o.girar...@lateral-tho

Re: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-06-22 Thread Olivier Girardot
Hi, I can't get this to work using CDH 5.4, Spark 1.4.0 in yarn cluster mode. @andrew did you manage to get it work with the latest version ? Le mar. 21 avr. 2015 à 00:02, Andrew Lee a écrit : > Hi Marcelo, > > Exactly what I need to track, thanks for the JIRA pointer. > > > > Date: Mon, 20 Apr

Re: Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
gards, Olivier. Le mer. 17 juin 2015 à 11:37, Olivier Girardot < o.girar...@lateral-thoughts.com> a écrit : > Hi everyone, > After copying the hive-site.xml from a CDH5 cluster, I can't seem to > connect to the hive metastore using spark-shell, here's a part of the stack &

Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
Hi everyone, After copying the hive-site.xml from a CDH5 cluster, I can't seem to connect to the hive metastore using spark-shell, here's a part of the stack trace I get : 15/06/17 04:41:57 ERROR TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Cause

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Olivier Girardot
You can use it as a broadcast variable, but if it's "too" large (more than 1Gb I guess), you may need to share it joining this using some kind of key to the other RDDs. But this is the kind of thing broadcast variables were designed for. Regards, Olivier. Le jeu. 4 juin 2015 à 23:50, dgoldenberg

Re: Best strategy for Pandas -> Spark

2015-06-02 Thread Olivier Girardot
Thanks for the answer, I'm currently doing exactly that. I'll try to sum-up the usual Pandas <=> Spark Dataframe caveats soon. Regards, Olivier. Le mar. 2 juin 2015 à 02:38, Davies Liu a écrit : > The second one sounds reasonable, I think. > > On Thu, Apr 30, 2015 at

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
e classes in hiveUdfs.scala which expose hiveUdaf's as Spark > SQL AggregateExpressions, but they are private. > > On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> I've finally come to the same conclusion, but isn'

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
ey: Int, value: String) > val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF > df.registerTempTable("table") > sqlContext.sql("select percentile(key,0.5) from table").show() > > ​ > > On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <

Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this

Re: RandomSplit with Spark-ML and Dataframe

2015-05-19 Thread Olivier Girardot
/github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.-Xiangrui> > -Xiangrui > <https://github.com/apache/spark/blob/master/python/pyspark/ml/tuning.py#L214.-Xiangrui> > > On Thu, May 7, 2015 at 8:39 AM, Olivier Girardot > wrote: > > Hi, >

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-18 Thread Olivier Girardot
PR is opened : https://github.com/apache/spark/pull/6237 Le ven. 15 mai 2015 à 17:55, Olivier Girardot a écrit : > yes, please do and send me the link. > @rxin I have trouble building master, but the code is done... > > > Le ven. 15 mai 2015 à 01:27, Haopu Wang a écrit :

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-15 Thread Olivier Girardot
yes, please do and send me the link. @rxin I have trouble building master, but the code is done... Le ven. 15 mai 2015 à 01:27, Haopu Wang a écrit : > Thank you, should I open a JIRA for this issue? > > > -- > > *From:* Olivier Girardot [mailto

Re: Why so slow

2015-05-12 Thread Olivier Girardot
can you post the explain too ? Le mar. 12 mai 2015 à 12:11, Jianshi Huang a écrit : > Hi, > > I have a SQL query on tables containing big Map columns (thousands of > keys). I found it to be very slow. > > select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as > avg > from test

Re: value toDF is not a member of RDD object

2015-05-12 Thread Olivier Girardot
"` to `build.sbt` but the error remains. Do I need to import modules > other than `import org.apache.spark.sql.{ Row, SQLContext }`? > > On Tue, May 12, 2015 at 5:56 PM Olivier Girardot > wrote: > >> toDF is part of spark SQL so you need Spark SQL dependency

Re: value toDF is not a member of RDD object

2015-05-12 Thread Olivier Girardot
toDF is part of spark SQL so you need Spark SQL dependency + import sqlContext.implicits._ to get the toDF method. Regards, Olivier. Le mar. 12 mai 2015 à 11:36, SLiZn Liu a écrit : > Hi User Group, > > I’m trying to reproduce the example on Spark SQL Programming Guide >

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Olivier Girardot
ullable. > > On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot > wrote: > >> Hi Haopu, >> actually here `key` is nullable because this is your input's schema : >> >> scala> result.printSchema >> root >> |-- key: string (nullable = true) >&

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Olivier Girardot
Hi Haopu, actually here `key` is nullable because this is your input's schema : scala> result.printSchema root |-- key: string (nullable = true) |-- SUM(value): long (nullable = true) scala> df.printSchema root |-- key: string (nullable = true) |-- value: long (nullable = false) I tried it with

RandomSplit with Spark-ML and Dataframe

2015-05-07 Thread Olivier Girardot
Hi, is there any best practice to do like in MLLib a randomSplit of training/cross-validation set with dataframes and the pipeline API ? Regards Olivier.

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Olivier Girardot
"hdfs://some ip:8029/dataset/*/*.parquet" doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki a écrit : > Spark 1.3.1 - > i have a parquet file on hdfs partitioned by some string looking like this > /dataset/city=London/data.parquet > /dataset/city=NewYork/data.parquet > /dataset/city=Pari

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
gt; >> but this issue still happens. >> >> >> ---- >> >> Thanks&Best regards! >> 罗辉 San.Luo >> >> - 原始邮件 - >> 发件人:Olivier Girardot >> 收件人:luohui20...@sina.com, user >> 主题:Re: sparksql ru

Re: AJAX with Apache Spark

2015-05-04 Thread Olivier Girardot
Hi Sergio, you shouldn't architecture it this way, rather update a storage with Spark Streaming that your Play App will query. For example a Cassandra table, or Redis, or anything that will be able to answer you in milliseconds, rather than "querying" the Spark Streaming program. Regards, Olivier

Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, a écrit : > hi guys > > when i am running a sql like "select a.name,a.startpoint,a.endpoint, > a.piece from db a join sample b on (a.name = b.name) where (b.startpoint > > a.startpoint + 25);" I found sparks

Re: Drop a column from the DataFrame.

2015-05-03 Thread Olivier Girardot
great thx Le sam. 2 mai 2015 à 23:58, Ted Yu a écrit : > This is coming in 1.4.0 > https://issues.apache.org/jira/browse/SPARK-7280 > > > > On May 2, 2015, at 2:27 PM, Olivier Girardot wrote: > > Sounds like a patch for a "drop" method... > > Le sam

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread Olivier Girardot
Can you post your code, otherwise there's not much we can do. Regards, Olivier. Le sam. 2 mai 2015 à 21:15, shahab a écrit : > Hi, > > I am using sprak-1.2.0 and I used Kryo serialization but I get the > following excepton. > > java.io.IOException: com.esotericsoftware.kryo.KryoException: > ja

Re: Drop a column from the DataFrame.

2015-05-02 Thread Olivier Girardot
Sounds like a patch for a "drop" method... Le sam. 2 mai 2015 à 21:03, dsgriffin a écrit : > Just use select() to create a new DataFrame with only the columns you want. > Sort of the opposite of what you want -- but you can select all but the > columns you want minus the one you don. You could e

Re: Can I group elements in RDD into different groups and let each group share some elements?

2015-05-02 Thread Olivier Girardot
Did you look at the cogroup transformation or the cartesian transformation ? Regards, Olivier. Le sam. 2 mai 2015 à 22:01, Franz Chien a écrit : > Hi all, > > Can I group elements in RDD into different groups and let each group share > elements? For example, I have 10,000 elements in RDD from

Re: to split an RDD to multiple ones?

2015-05-02 Thread Olivier Girardot
I guess : val srdd_s1 = srdd.filter(_.startsWith("s1_")).sortBy(_) val srdd_s2 = srdd.filter(_.startsWith("s2_")).sortBy(_) val srdd_s3 = srdd.filter(_.startsWith("s3_")).sortBy(_) Regards, Olivier. Le sam. 2 mai 2015 à 22:53, Yifan LI a écrit : > Hi, > > I have an RDD *srdd* containing (unor

Best strategy for Pandas -> Spark

2015-04-30 Thread Olivier Girardot
Hi everyone, Let's assume I have a complex workflow of more than 10 datasources as input - 20 computations (some creating intermediary datasets and some merging everything for the final computation) - some taking on average 1 minute to complete and some taking more than 30 minutes. What would be f

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
You mean after joining ? Sure, my question was more if there was any best practice preferred to joining the other dataframe for filtering. Regards, Olivier. Le mer. 29 avr. 2015 à 13:23, Olivier Girardot a écrit : > Hi everyone, > what is the most efficient way to filter a DataFram

Dataframe filter based on another Dataframe

2015-04-29 Thread Olivier Girardot
Hi everyone, what is the most efficient way to filter a DataFrame on a column from another Dataframe's column. The best idea I had, was to join the two dataframes : > val df1 : Dataframe > val df2: Dataframe > df1.join(df2, df1("id") === df2("id"), "inner") But I end up (obviously) with the "id"

How to distribute Spark computation recipes

2015-04-27 Thread Olivier Girardot
Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to serialize/deserialize full RDD computations ? @rxin Spark SQL is, in a way, already doing this but the parsers are private[sql], is there any way to

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Olivier Girardot
Hi Sourav, Can you post your updateFunc as well please ? Regards, Olivier. Le mar. 21 avr. 2015 à 12:48, Sourav Chandra a écrit : > Hi, > > We are building a spark streaming application which reads from kafka, does > updateStateBykey based on the received message type and finally stores into >

Re: Can a map function return null

2015-04-18 Thread Olivier Girardot
You can return an RDD with null values inside, and afterwards filter on "item != null" In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item => if (item % 2 ==0) Some(item) else None).collec

Re: Build spark failed with maven

2015-02-14 Thread Olivier Girardot
Hi, this was not reproduced for me, what kind of jdk are you using for the zinc server ? Regards, Olivier. 2015-02-11 5:08 GMT+01:00 Yi Tian : > Hi, all > > I got an ERROR when I build spark master branch with maven (commit: > 2d1e916730492f5d61b97da6c483d3223ca44315) > > [INFO] > [INFO] > --

Re: Opening Spark on IntelliJ IDEA

2014-11-29 Thread Olivier Girardot
Hi, are you using spark for a java or scala project and can you post your pom file please ? Regards, Olivier. 2014-11-27 7:07 GMT+01:00 Taeyun Kim : > Hi, > > > > An information about the error. > > On File | Project Structure window, the following error message is > displayed with pink backgro

Re: Cannot access data after a join (error: value _1 is not a member of Product with Serializable)

2014-11-19 Thread Olivier Girardot
can you please post the full source of your code and some sample data to run it on ? 2014-11-19 16:23 GMT+01:00 YaoPau : > I joined two datasets together, and my resulting logs look like this: > > > (975894369,((72364,20141112T170627,web,MEMPHIS,AR,US,Central),(Male,John,Smith))) > > (253142991,(

Re: RDD to Multiple Tables SparkSQL

2014-10-21 Thread Olivier Girardot
If you already know your keys the best way would be to "extract" one RDD per key (it would not bring the content back to the master and you can take advantage of the caching features) and then execute a registerTempTable by Key. But I'm guessing, you don't know the keys in advance, and in this cas

Re: Convert Iterable to RDD

2014-10-21 Thread Olivier Girardot
I don't think this is provided out of the box, but you can use toSeq on your Iterable and if the Iterable is lazy, it should stay that way for the Seq. And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your RDD. For the Iterable[Iterable[T]] you can flatten it and then create y

Re: spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-21 Thread Olivier Girardot
Could you please provide some of your code, and the sample json files you use ? Regards, Olivier. 2014-10-21 5:45 GMT+02:00 tridib : > Hello Experts, > I have two tables build using jsonFile(). I can successfully run join query > on these tables. But once I cacheTable(), all join query fails? >

Re: default parallelism bug?

2014-10-21 Thread Olivier Girardot
Hi, what do you mean by pretty small ? How big is your file ? Regards, Olivier. 2014-10-21 6:01 GMT+02:00 Kevin Jung : > I use Spark 1.1.0 and set these options to spark-defaults.conf > spark.scheduler.mode FAIR > spark.cores.max 48 > spark.default.parallelism 72 > > Thanks, > Kevin > > > > --