Re: CATALYST rule join

2018-02-27 Thread tan shai
operation `Generate explode` appears many times in the physical plan. Do you have any other ideas ? Maybe rewriting the code. Thank you 2018-02-25 23:08 GMT+01:00 tan shai : > Hi, > > I need to write a rule to customize the join function using Spark Catalyst > optimizer. The objective

CATALYST rule join

2018-02-25 Thread tan shai
Hi, I need to write a rule to customize the join function using Spark Catalyst optimizer. The objective to duplicate the second dataset using this process: - Execute a udf on the column called x, this udf returns an array - Execute an explode function on the new column Using SQL terms, my objec

Tuning Spark memory

2016-09-23 Thread tan shai
Hi, I am working with Spark 2.0, the job starts by sorting the input data and storing the output on HDFS. I am getting Out of memory errors, the solution was to increase the value of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the problem. But in the documentation I have found th

Total memory of workers

2016-09-06 Thread tan shai
Hello, Can anyone explain to me the behavior of spark if the size of the processed file is greater than the total memory available on workers? Many thanks.

RangePartitioning

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder], numPartiti

[no subject]

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder], numPartiti

Re: Extend Dataframe API

2016-07-07 Thread tan shai
That was what I am thinking to do. Do you have any idea about this? Or any documentations? Many thanks. 2016-07-07 17:07 GMT+02:00 Koert Kuipers : > i dont see any easy way to extend the plans, beyond creating a custom > version of spark. > > On Thu, Jul 7, 2016 at 9:31 AM, tan

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
ou could however do dataFrame.rdd, to force it to create a physical plan > that results in an actual rdd, and then query the rdd for partition info. > > On Thu, Jul 7, 2016 at 4:24 AM, tan shai wrote: > >> Using partitioning with dataframes, how can we retrieve informations >> abou

Extend Dataframe API

2016-07-07 Thread tan shai
Hi, I need to add new operations to the dataframe API. Can any one explain to me how to extend the plans of query execution? Many thanks.

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
lter by time or network_id or both and with other field > Spark only load part of time and network in filter then filter the rest. > > > > > On Jul 7, 2016, at 4:43 PM, Ted Yu wrote: > > > > Does the filter under consideration operate on sorted column(s) ? > > >

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Yes it is operating on the sorted column 2016-07-07 11:43 GMT+02:00 Ted Yu : > Does the filter under consideration operate on sorted column(s) ? > > Cheers > > > On Jul 7, 2016, at 2:25 AM, tan shai wrote: > > > > Hi, > > > > I have a sorted dataframe

Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Hi, I have a sorted dataframe, I need to optimize the filter operations. How does Spark performs filter operations on sorted dataframe? It is scanning all the data? Many thanks.

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Using partitioning with dataframes, how can we retrieve informations about partitions? partitions bounds for example Thanks, Shaira 2016-07-07 6:30 GMT+02:00 Koert Kuipers : > spark does keep some information on the partitions of an RDD, namely the > partitioning/partitioner. > > GroupSorted is

Dataframe sort

2016-07-05 Thread tan shai
Hi, I need to sort a dataframe and retrive the bounds of each partition. The dataframe.sort() is using the range partitioning in the physical plan. I need to retrieve partition bounds. Many thanks for your help.