Re: [Announcement] Cloud data lake conference with heavy focus on open source

2020-07-07 Thread Ashley Hoff
Interesting  You've piqued my interest.  Will the sessons be available
after the conference?  (I'm in the wrong timezone to see this during
daylight hours)

On Wed, Jul 8, 2020 at 2:40 AM ldazaa11  wrote:

> Hello Sparkers,
>
> If you’re interested in how Spark is being applied in cloud data lake
> environments, then you should check out a new 1-day LIVE, virtual
> conference
> on July 30. This conference is called Subsurface and the focus is technical
> talks tailored specifically for data architects and engineers building
> cloud
> data lakes and related technologies.
>
> Here are some of the speakers presenting at the event:
>
> Wes McKinney - Director at Ursa Labs, Pandas Creator and Apache Arrow
> co-creator.
> Maxime Beauchemin - CEO and Founder, Preset. Apache Superset and Airflow
> Creator.
> Julien Le Dem - Co-founder and CTO at Datakin. Apache Parquet Co-creator.
> Daniel Weeks - Big Data Compute Team Lead, Netflix - Parquet Committer and
> Hive Contributor.
>
>
> You can join  here
> <
> https://subsurfaceconf.com/summer2020?utm_medium=website_source=open-source_term=na_content=na_campaign=2020-subsurface-summer>
>
> (there’s no cost)
>
> The Subsurface Team
> @subsurfaceconf
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Kustoms On Silver 


Re: Mocking pyspark read writes

2020-07-07 Thread Jörn Franke
Write to a local temp directory via file:// ?

> Am 07.07.2020 um 20:07 schrieb Dark Crusader :
> 
> 
> Hi everyone,
> 
> I have a function which reads and writes a parquet file from HDFS. When I'm 
> writing a unit test for this function, I want to mock this read & write.
> 
> How do you achieve this? 
> Any help would be appreciated. Thank you.
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Mocking pyspark read writes

2020-07-07 Thread Dark Crusader
Hi everyone,

I have a function which reads and writes a parquet file from HDFS. When I'm
writing a unit test for this function, I want to mock this read & write.

How do you achieve this?
Any help would be appreciated. Thank you.


[Announcement] Cloud data lake conference with heavy focus on open source

2020-07-07 Thread ldazaa11
Hello Sparkers, 

If you’re interested in how Spark is being applied in cloud data lake
environments, then you should check out a new 1-day LIVE, virtual conference
on July 30. This conference is called Subsurface and the focus is technical
talks tailored specifically for data architects and engineers building cloud
data lakes and related technologies. 

Here are some of the speakers presenting at the event:

Wes McKinney - Director at Ursa Labs, Pandas Creator and Apache Arrow
co-creator.
Maxime Beauchemin - CEO and Founder, Preset. Apache Superset and Airflow
Creator.
Julien Le Dem - Co-founder and CTO at Datakin. Apache Parquet Co-creator.
Daniel Weeks - Big Data Compute Team Lead, Netflix - Parquet Committer and
Hive Contributor.


You can join  here

  
(there’s no cost)

The Subsurface Team
@subsurfaceconf





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: When does SparkContext.defaultParallelism have the correct value?

2020-07-07 Thread Sean Owen
If not set explicitly with spark.default.parallelism, it will default
to the number of cores currently available (minimum 2). At the very
start, some executors haven't completed registering, which I think
explains why it goes up after a short time. (In the case of dynamic
allocation it will change over time.) You can set it explicitly to
match what you set the executor count and cores to.

On Mon, Jul 6, 2020 at 10:35 PM Stephen Coy
 wrote:
>
> Hi there,
>
> I have found that if I invoke
>
> sparkContext.defaultParallelism()
>
> too early it will not return the correct value;
>
> For example, if I write this:
>
> final JavaSparkContext sparkContext = new 
> JavaSparkContext(sparkSession.sparkContext());
> final int workerCount = sparkContext.defaultParallelism();
>
> I will get some small number (which I can’t recall right now).
>
> However, if I insert:
>
> sparkContext.parallelize(List.of(1, 2, 3, 4)).collect()
>
> between these two lines I get the expected value being something like 
> node_count * node_core_count;
>
> This seems like a hacky work around solution to me. Is there a better way to 
> get this value initialised properly?
>
> FWIW, I need this value to size a connection pool (fs.s3a.connection.maximum) 
> correctly in a cluster independent way.
>
> Thanks,
>
> Steve C
>
>
> [http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]
> This email contains confidential information of and is the copyright of 
> Infomedia. It must not be forwarded, amended or disclosed without consent of 
> the sender. If you received this message by mistake, please advise the sender 
> and delete all copies. Security of transmission on the internet cannot be 
> guaranteed, could be infected, intercepted, or corrupted and you should 
> ensure you have suitable antivirus protection in place. By sending us your or 
> any third party personal details, you consent to (or confirm you have 
> obtained consent from such third parties) to Infomedia’s privacy policy. 
> http://www.infomedia.com.au/privacy-policy/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



ANALYZE command not supported on Spark 2.3.2?

2020-07-07 Thread daniel123
Does anyone know if ANALYZE TABLE is supported on Spark 2.3.2? The command
doesnt appear in the documentation
(spark.apache.org/docs/2.3.2/sql-programming-guide.html)  although we can
launch it with estrange results
The analyse table job takes hours and doesnt launch any executors, it just
runs in the Driver process.

At the same time , on spark 3 the command is properly documented on the SQL
Reference (http://spark.apache.org/docs/latest/sql-ref-syntax.html)

Has anyone experienced similar issues? Should we assume that if the analyse
table command is not listed in the SQL Reference is not supported?

And second question, have tried to run the Hive Analyse table instead? Could
you share your results?

thx






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



how to disable hivemetastore connection

2020-07-07 Thread iamabug












Hi community,I am running hundreds of Spark jobs at the same time, which cause Hive Metastore connection numbers to be very high (> 1K), since the jobs do not use HMS really, so I wish to disable that, I have tried setting spark.sql.catalogImplementation config to in-memory, which is said to be useful but it turns out not. Any suggestion would be appreciated !code:spark = SparkSession \    .builder \    .appName(“test") \    .config("spark.sql.catalogImplementation", "in-memory") \    .config("spark.executor.memory", "1g") \    .getOrCreate()spark-submit command:spark2-submit \    --master yarn \    --deploy-mode cluster \    --name "test"\    --conf spark.sql.catalogImplementation=in-memory \    test.py \   Spark version: 2.2.0Hadoop version: 2.6.0

  










xiangzhang1128




xiangzhang1...@gmail.com








签名由
网易邮箱大师
定制

 





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org