,
Sachit Murarka
Reposting once.
Kind Regards,
Sachit Murarka
On Tue, Sep 20, 2022 at 6:56 PM Sachit Murarka
wrote:
> Hi All,
>
> I am getting below error , I read the document and understood that we need
> to set 2 properties
> spark.conf.set("spark.sql.parquet.int96RebaseMo
e datetime values w.r.t. the calendar difference during writing,
to get maximum
interoperability. Or set spark.sql.parquet.int96RebaseModeInWrite to
'CORRECTED' to write the datetime values as it is,
if you are 100% sure that the written files will only be read by Spark 3.0+
or other
syst
On Tue, Sep 13, 2022, 21:23 Sachit Murarka wrote:
> Hi Vibhor,
>
> Thanks for your response!
>
> There are some properties which can be set without changing this flag
> "spark.sql.legacy.setCommandRejectsSparkCoreConfs"
> post creation of spark session , like shuf
rs.scala:2322)
at
org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:157)
at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:41)
Kind Regards,
Sachit Murarka
Hello ,
Thanks for replying. I have installed Scala plugin in IntelliJ first then
also it's giving same error
Cannot find project Scala library 2.12.12 for module SparkSimpleApp
Thanks
Rajat
On Sun, Feb 27, 2022, 00:52 Bitfox wrote:
> You need to install scala first, the current version for
to be shared among various executors, I thought of using
Accumulator,
but the accumulator uses only Integral values.
Can someone please suggest how do I collect all errors in a list which are
coming from all records of RDD.
Thanks,
Sachit Murarka
Hi Chetan,
You can substract the data frame or use except operation.
First DF contains full rows.
Second DF contains unique rows (post remove duplicates)
Subtract first and second DF .
hope this helps
Thanks
Sachit
On Tue, Jun 22, 2021, 22:23 Chetan Khatri
wrote:
> Hi Spark Users,
>
> I want
Hello Spark Users,
We are receiving too much small small files. About 3 million. Reading it
using spark.read itself taking long time and job is not proceeding further.
Is there any way to fasten this and proceed?
Regards
Sachit Murarka
Hi All,
I am using Spark with Kubernetes, Can anyone please tell me how I can
handle restarting failed Spark jobs?
I have used following property but it is not working
restartPolicy:
type: OnFailure
Kind Regards,
Sachit Murarka
(Thread.java:834)
Kind Regards,
Sachit Murarka
topic”:{“1”:1499,“0":1410}}}*
Kind Regards,
Sachit Murarka
On Fri, Mar 12, 2021 at 5:44 PM Gabor Somogyi
wrote:
> Please see that driver side for example resolved in 3.1.0...
>
> G
>
>
> On Fri, Mar 12, 2021 at 1:03 PM Sachit Murarka
> wrote:
>
>> Hi Gabor,
&
Hi Gabor,
Thanks a lot for the response. I am using Spark 3.0.1 and this is spark
structured streaming.
Kind Regards,
Sachit Murarka
On Fri, Mar 12, 2021 at 5:30 PM Gabor Somogyi
wrote:
> Since you've not provided any version I guess you're using 2.x and you're
> hitt
1 could be
determined
Current Committed Offsets: {KafkaV2[Subscribe[my-topic]]:
{“my-topic”:{“1":1498,“0”:1410}}}
Current Available Offsets: {KafkaV2[Subscribe[my-topic]]:
{“my-topic”:{“1”:1499,“0":1410}}}
Kind Regards,
Sachit Murarka
d the data from
> partitions, you can choose to repartition the batch so it is processed by
> multiple tasks.
>
> On Mon, Mar 8, 2021 at 10:57 PM Sachit Murarka
> wrote:
>
>> Hi All,
>>
>> I am using Spark 3.0.1 Structuring streaming with Pyspark.
>>
>
load() .selectExpr("CAST(value AS STRING)")
query = df.writeStream.foreach(process_events).option("checkpointLocation",
"/opt/checkpoint").trigger(processingTime="30 seconds").start()
Kind Regards,
Sachit Murarka
Thanks Sean.
Kind Regards,
Sachit Murarka
On Mon, Mar 8, 2021 at 6:23 PM Sean Owen wrote:
> It's there in the error: No space left on device
> You ran out of disk space (local disk) on one of your machines.
>
> On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka
> wrote:
>
ge before
> asking people to go through it. Also I am pretty sure that the error is
> mentioned in the first line itself.
>
> Any ideas regarding the SPARK version, and environment that you are using?
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Mar 8, 2021 at 8
more\n\n"}
Kind Regards,
Sachit Murarka
Hi Mich,
Thanks for reply. Will checkout this.
Kind Regards,
Sachit Murarka
On Fri, Feb 26, 2021 at 2:14 AM Mich Talebzadeh
wrote:
> Hi Sachit,
>
> I managed to make mine work using the *foreachBatch function *in
> writeStream.
>
> "foreach" performs custom
reaming.sources.ForeachWriterTable$$anon$1$$anon$2@30f2abbb
+- Project [cast(value#8 as string) AS value#21]
+- StreamingDataSourceV2Relation [key#7, value#8, topic#9, partition#10,
offset#11L, timestamp#12, timestampType#13],
org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaScan@433a9c3b,
Kafk
ermination
return self._jsq.awaitTermination()
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
line 1304, in __call__
return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134,
in deco
raise_from(co
d above code will run multiple
times for each single message. If I change it for foreachbatch, will it
optimize it?
Kind Regards,
Sachit Murarka
:
> Hi Sachit,
>
> The fix verison on that JIRA says 3.0.2, so this fix is not yet released.
> Soon, there will be a 3.1.1 release, in the meantime you can try out the
> 3.1.1-rc which also has the fix and let us know your findings.
>
> Thanks,
>
>
> On Mon, Feb 1, 2
Application wise it wont show as such.
You can try to corelate it with explain plain output using some filters or
attribute.
Or else if you do not have too much queries in history. Just take queries
and find plan of those queries and match it with shown in UI.
I know thats the tedious task. But I
Hi arpan,
In spark shell when you type
:history.
then also it is not showing?
Thanks
Sachit
On Mon, 1 Feb 2021, 21:13 Arpan Bhandari, wrote:
> Hey Sachit,
>
> It shows the query plan, which is difficult to diagnose out and depict the
> actual query.
>
>
> Thanks,
> Arpan Bhandari
>
>
>
> --
>
Following is the related JIRA , Can someone pls check
https://issues.apache.org/jira/browse/SPARK-24266
I am using 3.0.1 , It says fixed in 3.0.0 and 3.1.0 . Could you please
suggest what can be done to avoid this?
Kind Regards,
Sachit Murarka
On Sun, Jan 31, 2021 at 6:38 PM Sachit Murarka
Regards,
Sachit Murarka
ur query.
Hope this helps!
Kind Regards,
Sachit Murarka
On Fri, Jan 29, 2021 at 9:33 PM Arpan Bhandari wrote:
> Hi Sachit,
>
> Yes it was executed using spark shell, history is already enabled. already
> checked sql tab but it is not showing the query. My spark version is 2.4.5
&
Hi Arpan,
Was it executed using spark shell?
If yes type :history
Do u have history server enabled?
If yes , go to the history and go to the SQL tab in History UI.
Thanks
Sachit
On Fri, 29 Jan 2021, 19:19 Arpan Bhandari, wrote:
> Hi ,
>
> Is there a way to track back spark sql after it has be
ent
Could you pls suggest Why deploy-mode client is mentioned in entrypoint.sh ?
I am running spark submit using deploy mode cluster but inside
entrypoint.sh which it is mentioned like that.
Kind Regards,
Sachit Murarka
Hi Vikas
1. Are you running in local mode? Master has local[*]
2. Pls mask the ip or confidential info while sharing logs
Thanks
Sachit
On Wed, 20 Jan 2021, 17:35 Vikas Garg, wrote:
> Hi,
>
> I am facing issue with spark executor. I am struggling with this issue
> since last many days and unab
Sure Sean. Thanks for confirmation.
On Fri, 15 Jan 2021, 10:57 Sean Owen, wrote:
> You can ignore that. Spark 3.x works with Java 11 but it will generate
> some warnings that are safe to disregard.
>
> On Thu, Jan 14, 2021 at 11:26 PM Sachit Murarka
> wrote:
>
>> Hi Al
,int)
WARNING: Please consider reporting this to the maintainers of
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release
Kind Regards,
Sachit
Hi ,
Yes I know by setting shuffle tracking property enabled we can use DRA.
But , it is marked as experimental. Is it advised to use ?
Also , regarding HPA. We do not have HPA differently as such for Spark.
Right?
Kind Regards,
Sachit Murarka
On Mon, Jan 11, 2021 at 2:17 AM Sandish Kumar HN
how can I proceed achieving pod scaling in Spark?
Please note : I am using Kubernetes with Spark operator.
Kind Regards,
Sachit Murarka
t will be transitioning from experimental to GA in
> this release.
>
> See: https://issues.apache.org/jira/browse/SPARK-33005
>
> Thanks,
>
> On Tue, Jan 5, 2021 at 12:41 AM Sachit Murarka
> wrote:
>
>> Hi Users,
>>
>> Could you please tell which Spark version have
Hi Users,
Could you please tell which Spark version have you used in Production for
Kubernetes.
Which is a recommended version for Production provided that both Streaming
and core apis have to be used using Pyspark.
Thanks !
Kind Regards,
Sachit Murarka
spark-test --conf
spark.executor.instances=5 --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf
spark.kubernetes.container.image=sparkpy local:///opt/spark/da/main.py
Kind Regards,
Sachit Murarka
On Mon, Jan 4, 2021 at 5:46 PM Prashant Sharma wrote:
> Hi Sachit
parameter mentioned in JIRA
too(spark.kubernetes.driverEnv.HTTP2_DISABLE=true), that also did not work.
Can anyone suggest what can be done?
Kind Regards,
Sachit Murarka
ct.
Please carefully study the documentation linked above for further help.
Original error was: No module named 'numpy.core._multiarray_umath'
Kind Regards,
Sachit Murarka
On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy
wrote:
> I'm not very familiar with the environments o
Regards,
Sachit Murarka
Hi All,
I have created a wheel file and I am using the following command to run the
spark job:
spark-submit --py-files application.whl main_flow.py
My application is unable to reference the modules. Do I need to do the pip
install of the wheel first?
Kind Regards,
Sachit Murarka
Hi All,
I am using Standalone Spark.
I am using dynamic memory allocation. Despite giving max executors, min
executors and initial executors, my streaming job is taking all executors
available in the cluster. Could anyone please suggest what can be wrong
here?
Please note source is Kafka.
I fe
x27;,1)
as anyid").show()
and as I mentioned when I am using 2 backslashes it is giving an exception
as follows:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near
index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
Kind Regards,
S
).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)
Can you please help here?
Kind Regards,
Sachit Murarka
Hi Users,
I have to write Unit Test cases for PySpark.
I think pytest-spark and "spark testing base" are good test libraries.
Can anyone please provide full reference for writing the test cases in
Python using these?
Kind Regards,
Sachit Murarka
Thanks
Sachit
Kind Regards,
Sachit Murarka
ough 25
> processes on a single node seem too high)
>
>
>
> *From: *Sachit Murarka
> *Date: *Tuesday, October 13, 2020 at 8:15 AM
> *To: *spark users
> *Subject: *RE: [EXTERNAL] Multiple applications being spawned
>
>
>
> *CAUTION*: This email originated fr
(PythonRunner.scala:346)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)
Kind Regards,
Sachit Murarka
On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka
wrote:
> Hi Users,
>
converting it back to dataframe and then applying 2 actions(Count & Write).
Please note : This was working fine till the previous week, it has started
giving this issue since yesterday.
Could you please tell what can be the reason for this behavior?
Kind Regards,
Sachit Murarka
data set since it has 2 cols only.
Thanks
Sachit
On Wed, 7 Oct 2020, 01:04 Eve Liao, wrote:
> Try to avoid broadcast. Thought this:
> https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6
> could be helpful.
>
> On Tue, Oct 6, 2020 at 12:18 PM
hundreds of thousands of rows is a broadcast candidate.
> Your broadcast variable is probably too large.
>
> On Tue, Oct 6, 2020 at 11:37 AM Sachit Murarka
> wrote:
>
>> Hello Users,
>>
>> I am facing an issue in spark job where I am doing row number() without
it.
Kind Regards,
Sachit Murarka
please suggest something? I have sufficient memory in executors
and the driver as well.
Kind Regards,
Sachit Murarka
hit
> PGP Encrypt is something that is not inbuilt with spark.
> I would suggest writing a shell script that would do pgp encrypt and use
> it in spark scala program , which would run from driver.
>
> Thanks
> Deepak
>
> On Mon, Aug 26, 2019 at 8:10 PM Sachit Murarka
> wrote
Hi All,
I want to encrypt my files available at HDFS location using PGP Encryption
How can I do it in spark. I saw Apache Camel but it seems camel is used
when source files are in Local location rather than HDFS.
Kind Regards,
Sachit Murarka
.
>>>
>>> Target is Oracle Database.
>>>
>>> My Goal is to maintain latest record for a key in Oracle. Could you
>>> please suggest how this can be implemented efficiently?
>>>
>>> Kind Regards,
>>> Sachit Murarka
>>>
>>
Hi All,
I will get records continously in text file form(Streaming). It will have
timestamp as field also.
Target is Oracle Database.
My Goal is to maintain latest record for a key in Oracle. Could you please
suggest how this can be implemented efficiently?
Kind Regards,
Sachit Murarka
Hi All,
I am using spark 2.2
I have enabled spark dynamic allocation with executor cores 4, driver cores
4 and executor memory 12GB driver memory 10GB.
In Spark UI, I see only 1 task is launched per executor.
Could anyone please help on this?
Kind Regards,
Sachit Murarka
Hi All,
I have simply added exception handling in my code in Scala. I am
getting NoClassDefFoundError . Any leads would be appreciated.
Thanks
Kind Regards,
Sachit Murarka
61 matches
Mail list logo