Hi Avadhut Narayan JoshiThe use case is achievable using Spark. Connection to
SQL Server possible as Mich mentioned below as longs as there a JDBC driver
that can connect to SQL ServerFor a production workloads important points to
consider, >> what is the QoS requirements for your case? at least
How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets
About Shuffle operations.
have you looked at
http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html
and the post mentioned there
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
also try compressing the output
Instead of spark-shell have you tried running it as a job.
how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of the tasks taking more time was they are
any GC
please look at the UI if not already it can provide lot of information
thanks for adding RDD lineage graph.
I could see 18 parallel tasks for HDFS Read was it changed.
what is the spark job configuration, how many executors and cores per
exeuctor
i would say keep the partitioning multiple of (no of executors * cores) for
all the RDD's
if you have 3 executors
it sure is not able to get sufficient resources from YARN to start the
containers.
is it only with this import job or if you submit any other job its failing
to start.
As a test just try to run another spark job or a mapredue job and see if
the job can be started.
Reduce the thrift server
when HTTP connection is opened you are opening a connection between specific
machine (with IP and NIC card) to another specific machine, so this can't be
serialized and used on other machine right!!
This isn't spark limitation.
I made a simple diagram if it helps. The Objects created at driver
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API
with some token passed, in the code provided so far if you have 2000
symbols, it will make 2000 new connections!! and 2000 API calls
connection objects can't/shouldn't be serialized and send to executors, they
should
what was the error when you are trying to run mapreduce import job when the
thrift server is running.
this is only config changed? what was the config before...
also share the spark thrift server job config such as no of executors, cores
memory etc.
My guess is your mapreduce job is unable to
apologies for the long answer.
understanding partitioning at each stage of the the RDD graph/lineage is
important for efficient parallelism and having load balanced. This applies
to working with any sources streaming or static.
you have tricky situation here of one source kafka with 9
here is my two cents, experts please correct me if wrong
its important to understand why one over other and for what kind of use
case. There might be sometime in future where low level API's are abstracted
and become legacy but for now in Spark RDD API is the core and low level
API, all higher
Assuming you are talking about Spark Streaming
1) How to analyze what part of code executes on Spark Driver and what part
of code executes on the executors?
RDD's can be understood as set of data transformations or set of jobs. Your
understanding deepens as you do more programming with Spark.
Summarizing
1) Static data set read from Parquet files as DataFrame in HDFS has initial
parallelism of 90 (based on no input files)
2) static data set DataFrame is converted as rdd, and rdd has parallelism of
18 this was not expected
dataframe.rdd is lazy evaluation there must be some operation
formatted
=
Assuming MessageHelper.sqlMapping schema is correctly mapped with input json
(it would help if the schema and sample json is shared)
here is explode function with dataframes similar functionality is available
with SQL
import sparkSession.implicits._
import
Assuming MessageHelper.sqlMapping schema is correctly mapped with input json
(it would help if the schema and sample json is shared)here is explode
function with dataframes similar functionality is available with SQL import
sparkSession.implicits._import org.apache.spark.sql.functions._val
you are creating streaming context each time
val streamingContext = new StreamingContext(sparkSession.sparkContext,
Seconds(config.getInt(Constants.Properties.BatchInterval)))
if you want fault-tolerance, to read from where it stopped between spark job
restarts, the correct way is to restore
16 matches
Mail list logo