Not that I know of. We did do some work to make it work faster in the case of
lower cardinality: https://issues.apache.org/jira/browse/SPARK-17949
On Wed, Mar 27, 2019 at 4:40 PM, Erik Erlandson < eerla...@redhat.com > wrote:
>
> BTW, if this is known, is there an existing JIRA I should link to
BTW, if this is known, is there an existing JIRA I should link to?
On Wed, Mar 27, 2019 at 4:36 PM Erik Erlandson wrote:
>
> At a high level, some candidate strategies are:
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
> trait itself) so that the update method can
They are unfortunately all pretty substantial (which is why this problem
exists) ...
On Wed, Mar 27, 2019 at 4:36 PM, Erik Erlandson < eerla...@redhat.com > wrote:
>
> At a high level, some candidate strategies are:
>
> 1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
At a high level, some candidate strategies are:
1. "fix" the logic in ScalaUDAF (possibly in conjunction with mods to UDAF
trait itself) so that the update method can do the right thing.
2. Expose TypedImperativeAggregate to users for defining their own, since
it already does the right thing.
3. As
Yes this is known and an issue for performance. Do you have any thoughts on
how to fix this?
On Wed, Mar 27, 2019 at 4:19 PM Erik Erlandson wrote:
> I describe some of the details here:
> https://issues.apache.org/jira/browse/SPARK-27296
>
> The short version of the story is that aggregating dat
I describe some of the details here:
https://issues.apache.org/jira/browse/SPARK-27296
The short version of the story is that aggregating data structures (UDTs)
used by UDAFs are serialized to a Row object, and de-serialized, for every
row in a data frame.
Cheers,
Erik
+1 from me - same as last time.
On Wed, Mar 27, 2019 at 1:31 PM DB Tsai wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.1.
>
> The vote is open until March 30 PST and passes if a majority +1 PMC votes are
> cast, with
> a minimum of 3 +1 votes.
>
> [ ]
I'm looking for recommendations on benchmarks for Spark. I'm familiar
with spark-bench[0], but I haven't found much else that suits my
needs. The main property I'm looking for is that the workload of the
benchmark should benefit significantly from non-trivial use of Spark's
caching mechanism since
+1, all the known blockers are resolved. Thanks for driving this!
On Wed, Mar 27, 2019 at 11:31 AM DB Tsai wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.4.1.
>
> The vote is open until March 30 PST and passes if a majority +1 PMC votes
> are cast, with
> a
Please vote on releasing the following candidate as Apache Spark version 2.4.1.
The vote is open until March 30 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...
To l
Kazuaki Ishizaki,
Yes, ColumnarBatchScan does provide a framework for doing code generation
for the processing of columnar data. I have to admit that I don't have a
deep understanding of the code generation piece, so if I get something
wrong please correct me. From what I had seen only input for
thanks a lot, seem work now and save a lot of time
Best Regards
Zhang,Liyun/Kelly Zhang
At 2019-03-26 17:49:03, "Ajith shetty" wrote:
You can try using -pl maven option for this
> mvn clean install -pl :spark-core_2.11
From:Qiu, Gerry
To:zhangliyun ;dev@spark.apache.org
Date:2019-0
Hi,
I want to control the placement of the partitions of the Property Graph
across my cluster nodes. As I understand, in order to specify the preferred
locations for a partition of an RDD, one will need to create a subclass
that overrides the getPreferredLocations() function. For example
the Paral
13 matches
Mail list logo