storage you had with DStreams.*
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disc
nkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/>
<https://www.linkedin.com/pulse/processing-change-data-capture-spark-structured-talebzadeh-ph-d-/>
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
Unite
t a direct equivalent of DStream HasOffsetRanges in Spark
Structured Streaming. However, Structured Streaming provides mechanisms to
achieve similar functionality:
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin prof
;: "string",
"null_count": 21921,
"null_percentage": 23.48,
"distinct_count": 38726,
"distinct_percentage": 41.49
}
}
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United King
git>*
The details are in the attached RENAME file
Let me know what you think! Feedback is always welcome.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph
b|
| d|
+---+
df_1:
+-+
| data|
+-+
|[a, b, c]|
| []|
+-+
Result:
++
|data|
++
| a|
| b|
++
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrimeLondon
United Kingdom
view my Linkedin profil
quot;, "5
minutes")). \
avg('temperature')
- Write to Sink: Write the filtered records (dropped records) to a
separate Kafka topic.
- Consume and Store: Consume the dropped records topic with another
streaming job and store them in a Postgres table or S3 usin
think about another way and revert
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
a look at the ticket and add your comments.
Thanks
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best
nCrime
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essentia
My recommendation! is using materialized views (MVs) created in Hive with
Spark Structured Streaming and Change Data Capture (CDC) is a good
combination for efficiently streaming view data updates in your scenario.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI
hat uUsing
materialized views with Spark Structured Streaming and Change Data Capture
(CDC) is a potential solution for efficiently streaming view data updates
in this scenario. .
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view m
ilar issue or if there
are any insights into why this discrepancy exists between Spark SQL and
Hive.
Thanks
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
, the
monotonically_increasing_id() sequence might restart from the beginning.
This could again cause duplicate IDs if other Spark applications are
running concurrently or if data is processed across multiple runs of the
same application..
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer
spark.sql.shuffle.partitions=auto
Because Apache Spark does not build clusters. This configuration option is
specific to Databricks, with their managed Spark offering. It allows
Databricks to automatically determine an optimal number of shuffle
partitions for your workload.
HTH
Mich Talebzadeh
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided
ces are limited (memory, CPU).
b) Data Skew: Uneven distribution of values in certain columns could
lead to imbalanced processing across machines. Check Spark UI (4040) on
staging and execution tabs
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
Uni
creation? Are joins matching columns correctly?
4) Specific Edge Issues: Can you share examples of vertex IDs with
incorrect connections? Is this related to ID generation or edge creation
logic?
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI, FinCrime
London
United
Interesting
My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within
foreachRDD creates an infinite loop within each Spark executor. This might
not be the most efficient approach, especially since offsets are committed
asynchronously.?
HTH
Mich Talebzadeh,
Technologist
ssages.
HTH
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The informat
|
Now I recently saw a note (if i recall correctly) that Spark should be
using camelCase in new spark related documents. What are the accepted views
or does it matter?
Thanks
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
tes')
ORDER BY window.start
"""
# Write the aggregated results to Kafka sink
stream = session.sql(query) \
.writeStream \
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092
, numBytes) => host
}.toArray
}
}
HTH
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*
your work. say with Oracle (as an example), utilise tools
like OEM, VM StatPack, SQL*Plus scripts etc or third-party monitoring tools
to collect detailed database health metrics directly from the Oracle
database server.
HTH
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Gen
anks
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided is cor
thin the trigger
interval, preventing backlogs and potential OOM issues.
>From Spark UI, look at the streaming tab. There are various statistics
there. In general your Processing Time has to be less than your batch
interval. The scheduling Delay and Total Delay are additional indicator of
Thanks Cheng for the heads up. I will have a look.
Cheers
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywi
a Kubernetes cluster. They can include
these configurations in the Spark application code or pass them as
command-line arguments or environment variables during application
submission.
HTH
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view
better performance and scalability for handling larger datasets
efficiently.
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
with these files systems come into
it. I will be interested in hearing more about any progress on this.
Thanks
.
Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer | Generative AI
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
n
window-generated 'start'
# Rest of the code remains the same
streaming_df.createOrReplaceTempView("streaming_df")
spark.sql("""
SELECT
window.start, window.end, provinceId, totalPayAmount
FROM streaming_df
ORDER BY window.start
""") \
.writeStream \
.format
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")
.writeStream
.format("kafka")
.option("checkpointLocation", "checkpoint")
.option("kafka.bootstr
Sorry from this link
Leveraging Generative AI with Apache Spark: Transforming Data Engineering |
LinkedIn
<https://www.linkedin.com/pulse/leveraging-generative-ai-apache-spark-transforming-mich-lxbte/?trackingId=aqZMBOg4O1KYRB4Una7NEg%3D%3D>
Mich Talebzadeh,
Technologist | Data | Generat
You may find this link of mine in Linkedin for the said article. We
can use Linkedin for now.
Leveraging Generative AI with Apache Spark: Transforming Data
Engineering | LinkedIn
Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom
view my Linkedin
g("MDVariables.targetDataset"),
config.getString("MDVariables.targetTable"))
df.unpersist()
// println("wrote to DB")
} else {
println("DataFrame df is empty")
}
}
If the DataFrame is empty, it prints a message indicating that the
DataFrame is empty. You
ertain this idea. They
seem to have a well defined structure for hosting topics.
Let me know your thoughts
Thanks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
be that the information
(topics) are provided as best efforts and cannot be guaranteed.
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywi
- Databricks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/&
+1 for me
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The information provided is c
tools like Spark UI or third-party libraries.for this
purpose.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best
hat should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The informatio
such as
pipelining transformations and removing unnecessary computations.
"I may need something like that for synthetic data for testing. Any way to
do that ?"
Have a look at this.
https://github.com/joke2k/faker
<https://github.com/joke2k/faker>HTH
Mich Talebzadeh,
Dad | Technologist | Sol
and fraudulent transactions to
build a machine learning model to detect fraudulent transactions using
PySpark's MLlib library. You can install it via pip install Faker
Details from
https://github.com/joke2k/faker
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United
You can get additional info from Spark UI default port 4040 tabs like
SQL and executors
- Spark uses Catalyst optimiser for efficient execution plans.
df.explain("extended") shows both logical and physical plans
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect |
edRowsPerSecond")) fails because it uses the
.get() method, while the second line (processedRowsPerSecond =
microbatch_data.processedRowsPerSecond) accesses the attribute directly.
In short, they need to ensure that that event.progress* returns a
dictionary *
Cheers
Mich Talebzadeh,
Dad | Tec
---++
| key|doubled_value|op_type| op_time|
++-+---++
|a960f663-d13a-49c...|2| 1|2024-03-11 12:17:...|
++-+---+----+
I am afraid it is not working. Not even printing anything
Cheers
the output sink (for example, console sink)
query = (
processed_streaming_df.select( \
col("key").alias("key") \
, col("doubled_value").alias("doubled_value") \
, col("op_type"
"spark.sql.warehouse.dir") to print the configured
warehouse directory after creating the SparkSession to confirm all is OK
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-tale
ent: Use --py-files if your application
code consists mainly of Python files and doesn't require a separate virtual
environment.
- Separate Virtual Environment: Use --conf spark.yarn.dist.archives if
you manage dependencies in a separate virtual environment archive.
HTH
Mich Talebzadeh
--conf spark.executor.cores=3 \
--conf spark.driver.memory=1024m \
--conf spark.executor.memory=1024m \
* $CODE_DIRECTORY_CLOUD/${APPLICATION}*
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://ww
d")
My question is can these operations be done more efficiently in Pyspark
itself ideally with one df operation reading the original file (.bz2.zip)?
Thanks
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
http
)
# Show the result
joined_df.show()
root
|-- combined_id: array (nullable = true)
||-- element: long (containsNull = true)
root
|-- mr_id: long (nullable = true)
+---+-+
|combined_id|mr_id|
+---+-+
| [1, 2, 3]|2|
| [4, 5, 6]|5|
+---+-+
HTH
cause
partial aggregation if a single executor processes most items of a
particular type.
- Partial Aggregations, Spark might be combining partial counts from
executors incorrectly, leading to inaccuracies.
- Finally a bug in 3.5 is possible.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions
Model (Java objects) -->
Spring Application Logic (Controllers, Services, Repositories)
etc. Is this a good guess?
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d
Hi,
These are all on spark 3.5, correct?
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disc
executor overhead. This is important for tasks that require
additional memory beyond the executor memory setting. Example
5. --conf spark.executor.memoryOverhead=1000
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
s detailed, but potentially functional, explanation.
- Manual Analysis*, *analyze the query structure and logical steps
yourself
- Spark UI, review the Spark UI (accessible through your Spark
application on 4040) for delving into query execution and potential
bottlenecks.
HTH
Mich Taleb
;: "Sleek and modern design, but lacking some features.",
"user_feedback": "Negative",
"review_source": "online",
"sentiment_confidence": 0.33,
"product_features": "user-friendly",
"timestamp": &quo
and the host spark version you are submitting your spark-submit
from?
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Tale
Log the exception print(f"Exception
in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
time.sleep(60) else: # Break out of the loop if the job completes
successfully break
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
Ok thanks for your clarifications
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The infor
and handling within Spark itself
using max_retries = 5 etc?
HTH
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Tale
Sure but first it would be beneficial to understand the way Spark works on
Kubernetes and the concept.s
Have a look at this article of mine
Spark on Kubernetes, A Practitioner’s Guide
<https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3Ftrackin
OK you have a jar file that you want to work with when running using Spark
on k8s as the execution engine (EKS) as opposed to YARN on EMR as the
execution engine?
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<ht
query-with-dependencies_2.12-0.22.2.jar
"${SPARK_EXTRA_JARS_DIR}"
Here I am accessing Google BigQuery DW from EKS cluster
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/
true)
Shuffle cleanup successful.
Iteration 5
root
|-- column_to_check: string (nullable = true)
|-- partition_column: long (nullable = true)
Shuffle cleanup successful.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profil
force
with shutil.rmtree(path) to remove these files.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* The
ther brute force is that instead of *SaveMode.Append*, you can try using
*SaveMode.Overwrite**.* This will overwrite the existing data if it already
exists.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedi
Hi Chao,
As a cool feature
- Compared to standard Spark, what kind of performance gains can be
expected with Comet?
- Can one use Comet on k8s in conjunction with something like a Volcano
addon?
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
Hi,I gather from the replies that the plugin is not currently available in
the form expected although I am aware of the shell script.
Also have you got some benchmark results from your tests that you can
possibly share?
Thanks,
Mich Talebzadeh,
Dad | Technologist | Solutions Architect
verify that the
spark.local.dir property
in your Spark configuration points to a writable directory with enough
space.
- Permission issues: check directories listed in spark.local.dir are
accessible by the Spark user with read/write permissions.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
gesJson.write.mode("append").parquet(data)
} catch {
case ex: Exception =>
ex.printStackTrace()
}
}
}
This modification uses Option *t*o handle potential null values in the rdd
and filters out any elements that are still "NO records found" after
to review the part of the code where the exception is
thrown and identifying which object or method call is resulting in *null* can
help the debugging process plus checking the logs.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my
The full code is available from the link below
https://github.com/michTalebzadeh/Event_Driven_Real_Time_data_processor_with_SSS_and_API_integration
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.
esktop>
HTH,
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own ri
-++--------++-+---+
only showing top 1 row
rows is 50
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebza
eckpoint_path)
.queryName(f"{query_name}")
.start()
)
break # Exit the loop after starting the streaming query
else:
time.sleep(5) # Sleep for a while before checking for the next
event
HTH
Mich Talebzadeh,
Dad | Technologis
uot;append")
.format("console")
.trigger(processingTime="1 second")
.option("checkpointLocation", checkpoint_path)
.start()
)
query.awaitTermination()
In this example, it listens to a socket on localhost: and expects a
single integer value per line. Y
dcb-8acfd2e9a61e, runId =
f71cd26f-7e86-491c-bcf2-ab2e3dc63eca] terminated with exception: No usable
value for offset
Did not find value which can be converted into long
Seems like there might be an issue with the *rate-micro-batch* source when
using the *startTimestamp* option.
You can
tate updates, and
available resources. In summary, what works well for one workload might
not be optimal for another.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph
ror(f"Error processing table {table_name}: {e}")
else:
print("DataFrame is empty") # Handle empty DataFrame
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-ta
Use an intermediate work table to put json data streaming in there in the
first place and then according to the tag store the data in the correct
table
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<ht
the necessary
state.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any a
ntermediate_df.cache()
# Use cached intermediate_df for further transformations or actions
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybo
ements.
Cheers
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk.
and performance of your Flask application.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at yo
yone's benefit. Hopefully your comments will
help me to improve it.
Cheers
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mic
ingestion and analytics. My use case revolves around a scenario where
data is generated through REST API requests in real time with Pyspark.. The
Flask REST API efficiently captures and processes this data, saving it to a
sync of your choice like a data warehouse or kafka.
HTH
Mich Talebzadeh,
Dad
67108864")
These configurations provide a starting point for tuning RocksDB. Depending
on your specific use case and requirements, of course, your mileage varies.
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<ht
How many topics and checkpoint directories are you dealing with?
Does each topic has its own checkpoint on S3?
All these checkpoints are sequential writes so even SSD would not really
help
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view
URL is mandatory even when using the
Spark Operator in Kubernetes deployments.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("AppConstants.APPLICATION_NAME") \
.config("spark.master",
"k8s://https://:")
\
.getOrCreat
Hi,
Do you have more info on this Jira besides the github link as I don't seem
to find it!
Thanks
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
Hi Stanislav ,
On Pyspark DF can you the following
df.printSchema()
and send the output please
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
/49785108/spark-streaming-with-python-how-to-add-a-uuid-column
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Tale
Ok so you want to generate some random data and load it into Kafka on a
regular interval and the rest?
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
Worth trying EXPLAIN
<https://spark.apache.org/docs/latest/sql-ref-syntax-qry-explain.html>statement
as suggested by @tianlangstudio
HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.co
Well not to put too finer point on it, in a public forum, one ought to
respect the importance of open communication. Everyone has the right to ask
questions, seek information, and engage in discussions without facing
unnecessary patronization.
Mich Talebzadeh,
Dad | Technologist | Solutions
Apologies Koert!
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own ris
rs to impersonate Identity and Access
Management (IAM) service accounts to access Google Cloud services."
*Cheers*
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzad
and manage
the Spark application's execution on the YARN cluster. HTH
Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki
1 - 100 of 2067 matches
Mail list logo