I don't think coalesce (by repartitioning I assume you mean coalesce) itself
and deserialising takes that much time. To add a little bit more context, the
computation of the DataFrame is CPU intensive instead of data/IO intensive. I
purposely keep coalesce after df.count as I want to keep the
Hi,
As you can found the description from the website[1] of Apache Kyuubi
(incubating):
"Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface
for end-users to manipulate large-scale data with pre-programmed and
extensible Spark SQL engines."
[1]: https://kyuubi.apache.org/
Best
What’s the difference between Spark and Kyuubi?
Thanks
On Mon, Jan 31, 2022 at 2:45 PM Vino Yang wrote:
> Hi all,
>
> The Apache Kyuubi (Incubating) community is pleased to announce that
> Apache Kyuubi (Incubating) 1.4.1-incubating has been released!
>
> Apache Kyuubi (Incubating) is a distrib
Hi all,
The Apache Kyuubi (Incubating) community is pleased to announce that
Apache Kyuubi (Incubating) 1.4.1-incubating has been released!
Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for
large-scale data processing and analytics, built on top of Apache Spark
and designed
Hi Stephen,
Thank you for your answer. Yes, I changed the scope to "provided" but got
the same error :-( FYI. I am getting this error while running tests.
Regards,
Aurelien
Le jeu. 27 janv. 2022 à 23:57, Stephen Coy a écrit :
> Hi Aurélien,
>
> Your Jackson versions look fine.
>
> What happen
Any particular code sample you can suggest to review on your tips?
> On Jan 30, 2022, at 06:16, Sebastian Piu wrote:
>
>
> This Message Is From an External Sender
> This message came from outside your organization.
> It's because all data needs to be pickled back and forth between java and a
This one you can ignore. It's from the JVM so you might be able to disable
it by configuring the right JVM logger as well, but it also tells you right
in the message how to turn it off!
But this is saying that some reflective operations are discouraged in Java
9+. They still work and Spark needs t
The signature in your mail has showed the info:
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
On Sun, Jan 30, 2022 at 8:50 PM Lucas Schroeder Rossi
wrote:
> unsubscribe
>
> -
> To unsubscribe e-mail: user-unsubscr.
unsubscribe
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
It's because all data needs to be pickled back and forth between java and a
spun python worker, so there is additional overhead than if you stay fully
in scala.
Your python code might make this worse too, for example if not yielding
from operations
You can look at using UDFs and arrow or trying t
Hello list,
I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure
scala program. The result shows the pyspark RDD is too slow.
For the operations and dataset please see:
https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/
The result table is b
Hi,
Can you please try to see if you can increase the number of cores per task,
and therefore give each task more memory per executor?
I do not understand what is the XML, what is the data in it, and what is
the problem that you are trying to solve writing UDF's to parse XML. So
maybe we are not
Hi,
without getting into suppositions, the best option is to look into the
SPARK UI SQL section.
It is the most wonderful tool to explain what is happening, and why. In
SPARK 3.x they have made the UI even better, with different set of
granularity and details.
On another note, you might want to
Hi,
I think it will be useful to understand the problem before solving the
problem.
Can I please ask what this table is? Is it a fact (event store) kind of a
table, or a dimension (master data) kind of table? And what are the
downstream consumptions of this table?
Besides that what is the unique
Hi,
I have often found that logging in the warnings is extremely useful, they
are just logs, and provide a lot of insights during upgrades, external
package loading, deprecation, debugging, etc.
Do you have any particular reason to disable the warnings in a submitted
job?
I used to disable warni
coalesce returns a new dataset.
That will cause the recomputation.
Thanks
Deepak
On Sun, 30 Jan 2022 at 14:06, Benjamin Du wrote:
> I have some PySpark code like below. Basically, I persist a DataFrame
> (which is time-consuming to compute) to disk, call the method
> DataFrame.count to trigger
It's probably the repartitioning and deserialising the df that you are
seeing take time. Try doing this
1. Add another count after your current one and compare times
2. Move coalesce before persist
You should see
On Sun, 30 Jan 2022, 08:37 Benjamin Du, wrote:
> I have some PySpark code like
Hi Amit,
before answering your question, I am just trying to understand it.
I am not exactly clear how do the Akka application, Kafka and SPARK
Streaming application sit together, and what are you exactly trying to
achieve?
Can you please elaborate?
Regards,
Gourav
On Fri, Jan 28, 2022 at 10:
I have some PySpark code like below. Basically, I persist a DataFrame (which is
time-consuming to compute) to disk, call the method DataFrame.count to trigger
the caching/persist immediately, and then I coalesce the DataFrame to reduce
the number of partitions (the original DataFrame has 30,000
20 matches
Mail list logo