Re: Log4j2 upgrade

2022-01-12 Thread Jörn Franke
You cannot simply replace it - log4j2 has a slightly different API than log4j. 
The Spark source code needs to be changed in a couple of places 

> Am 12.01.2022 um 20:53 schrieb Amit Sharma :
> 
> 
> Hello, everyone. I am replacing log4j with log4j2 in my spark streaming 
> application. When i deployed my application to spark cluster it is giving me 
> the below error .
> 
> " ERROR StatusLogger Log4j2 could not find a logging implementation. Please 
> add log4j-core to the classpath. Using SimpleLogger to log to the console "
> 
> 
> I am including the core jar in my fat jar and core jar also included in the 
> jar. Although the application is running fine, I am doubtful the logs are 
> generated using log4j not log4j2 .
> I am using sbt assembly jar and also noticed below  messages in the build
> 
> Fully-qualified classname does not match jar entry:
>   jar entry: META-INF/versions/9/module-info.class
> 
> 
>   class name: module-info.class
> Omitting META-INF/versions/9/module-info.class.
> Fully-qualified classname does not match jar entry:
>   jar entry: 
> META-INF/versions/9/org/apache/logging/log4j/util/Base64Util.class
>   class name: org/apache/logging/log4j/util/Base64Util.class
> Omitting META-INF/versions/9/org/apache/logging/log4j/util/Base64Util.class.
> Fully-qualified classname does not match jar entry:
>   jar entry: 
> META-INF/versions/9/org/apache/logging/log4j/util/internal/DefaultObjectInputFilter.class
> 
> 
> any idea how to resolve these.
> 
> 
> Thanks
> Amit

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Log4j2 upgrade

2022-01-12 Thread Amit Sharma
Hello, everyone. I am replacing log4j with log4j2 in my spark streaming
application. When i deployed my application to spark cluster it is giving
me the below error .

" ERROR StatusLogger Log4j2 could not find a logging implementation. Please
add log4j-core to the classpath. Using SimpleLogger to log to the console "


I am including the core jar in my fat jar and core jar also included in the
jar. Although the application is running fine, I am doubtful the logs are
generated using log4j not log4j2 .
I am using sbt assembly jar and also noticed below  messages in the build

Fully-qualified classname does not match jar entry:
  jar entry: META-INF/versions/9/module-info.class


  class name: module-info.class
Omitting META-INF/versions/9/module-info.class.
Fully-qualified classname does not match jar entry:
  jar entry:
META-INF/versions/9/org/apache/logging/log4j/util/Base64Util.class
  class name: org/apache/logging/log4j/util/Base64Util.class
Omitting META-INF/versions/9/org/apache/logging/log4j/util/Base64Util.class.
Fully-qualified classname does not match jar entry:
  jar entry:
META-INF/versions/9/org/apache/logging/log4j/util/internal/DefaultObjectInputFilter.class


any idea how to resolve these.


Thanks
Amit


RE: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Crowe, John
I get that Sean, I really do,  but customers being customers, they see Log4j, 
and they panic.. I’ve been telling them since this began that Version 1x is not 
affected, but.. but..

Letting them know that 2.17.1 is on the way, IS helpful, but of course they ask 
us when is it coming?  Just trying to reduce the madness.. 

Regards;
John Crowe
TDi Technologies, Inc.
1600 10th Street Suite B
Plano, TX  75074
(800) 695-1258
supp...@tditechnologies.com

From: Sean Owen 
Sent: Wednesday, January 12, 2022 10:23 AM
To: Crowe, John 
Cc: user@spark.apache.org
Subject: Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target 
release day for Spark3.3?

Again: the CVE has no known effect on released Spark versions. Spark 3.3 will 
have log4j 2.x anyway.

On Wed, Jan 12, 2022 at 10:21 AM Crowe, John 
mailto:john.cr...@tditechnologies.com>> wrote:
I too would like to know when you anticipate Spark 3.3.0 to be released due to 
the Log4j CVE’s.
Our customers are all quite concerned.


Regards;
John Crowe
TDi Technologies, Inc.
1600 10th Street Suite B
Plano, TX  75074
(800) 695-1258
supp...@tditechnologies.com

From: Juan Liu mailto:liuj...@cn.ibm.com>>
Sent: Wednesday, January 12, 2022 8:50 AM
To: user@spark.apache.org
Cc: Theodore J Griesenbrock mailto:t...@ibm.com>>
Subject: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target 
release day for Spark3.3?

Dear Spark support,

Due to the known log4j security issue, we are required to upgrade log4j version 
to 2.17.1. Currently, we use Spark3.1.2 with default log4j 1.2.17. Also we 
found log4j configuration document here:  
https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging

Our questions:

  *   Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from 1.* to 
2.17.1 in Spark? would you pls help to provide guidance?
  *   If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark 3.2? 
pls also help to provide guidance, thanks!
  *   We found Spark 3.3 will support log4j migrate from 1 to 2 in this ticket: 
https://issues.apache.org/jira/browse/SPARK-37814, also I noticed all sub-tasks 
are done except one.  it's awesome! would you pls help to advise your target 
release day? if it's in very near future, like Jan, maybe we can wait for 3.3.

BTW, as log4j issue is very popular security issue, it's better if Spark team 
could post the solution directly in security page 
(https://spark.apache.org/security.html) to benefit end user.

Anyway, thank you so much for providing such a powerful tool for us, and thanks 
for your patience to read and reply this mail. Have a good day!
Juan Liu (刘娟) PMP®
Release Management, Watson Health, China Development Lab
Email: liuj...@cn.ibm.com
Phone: 86-10-82452506



Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Sean Owen
Again: the CVE has no known effect on released Spark versions. Spark 3.3
will have log4j 2.x anyway.

On Wed, Jan 12, 2022 at 10:21 AM Crowe, John 
wrote:

> I too would like to know when you anticipate Spark 3.3.0 to be released
> due to the Log4j CVE’s.
>
> Our customers are all quite concerned.
>
>
>
>
>
> Regards;
>
> John Crowe
>
> TDi Technologies, Inc.
>
> 1600 10th Street Suite B
>
> Plano, TX  75074
>
> (800) 695-1258
>
> supp...@tditechnologies.com
>
>
>
> *From:* Juan Liu 
> *Sent:* Wednesday, January 12, 2022 8:50 AM
> *To:* user@spark.apache.org
> *Cc:* Theodore J Griesenbrock 
> *Subject:* Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your
> target release day for Spark3.3?
>
>
>
> Dear Spark support,
>
> Due to the known log4j security issue, we are required to upgrade log4j
> version to 2.17.1. Currently, we use Spark3.1.2 with default log4j 1.2.17.
> Also we found log4j configuration document here:
> https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging
>
> Our questions:
>
>- Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from
>1.* to 2.17.1 in Spark? would you pls help to provide guidance?
>- If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark
>3.2? pls also help to provide guidance, thanks!
>- We found Spark 3.3 will support log4j migrate from 1 to 2 in this
>ticket: https://issues.apache.org/jira/browse/SPARK-37814, also I
>noticed all sub-tasks are done except one.  it's awesome! would you pls
>help to advise your target release day? if it's in very near future, like
>Jan, maybe we can wait for 3.3.
>
>
> BTW, as log4j issue is very popular security issue, it's better if Spark
> team could post the solution directly in security page (
> https://spark.apache.org/security.html) to benefit end user.
>
> Anyway, thank you so much for providing such a powerful tool for us, and
> thanks for your patience to read and reply this mail. Have a good day!
>
> *Juan Liu (**刘娟**) **PMP**®*
>
> Release Management, Watson Health, China Development Lab
> Email: liuj...@cn.ibm.com
> Phone: 86-10-82452506
>
>
>
>


Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Sean Owen
As noted, there is no known effect on Spark, as released versions do not
use an affected log4j version and configuration, thus no documentation
about remediation.
It is in any event a good idea to update to 2.x; please see JIRA for the
log4j 2.x update, which will come in Spark 3.3.0 as this is all discussed
in depth there.
There is no release date for Spark 3.3.0, but likely in a few months.

On Wed, Jan 12, 2022 at 8:59 AM Juan Liu  wrote:

> Dear Spark support,
>
> Due to the known log4j security issue, we are required to upgrade log4j
> version to 2.17.1. Currently, we use Spark3.1.2 with default log4j 1.2.17.
> Also we found log4j configuration document here:
> https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging
>
> Our questions:
>
>- Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from
>1.* to 2.17.1 in Spark? would you pls help to provide guidance?
>- If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark
>3.2? pls also help to provide guidance, thanks!
>- We found Spark 3.3 will support log4j migrate from 1 to 2 in this
>ticket: https://issues.apache.org/jira/browse/SPARK-37814, also I
>noticed all sub-tasks are done except one.  it's awesome! would you pls
>help to advise your target release day? if it's in very near future, like
>Jan, maybe we can wait for 3.3.
>
>
> BTW, as log4j issue is very popular security issue, it's better if Spark
> team could post the solution directly in security page (
> https://spark.apache.org/security.html) to benefit end user.
>
> Anyway, thank you so much for providing such a powerful tool for us, and
> thanks for your patience to read and reply this mail. Have a good day!
>
> *Juan Liu (刘娟) **PMP**®*
> Release Management, Watson Health, China Development Lab
> Email: liuj...@cn.ibm.com
> Phone: 86-10-82452506
>
>
>
>


RE: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Crowe, John
I too would like to know when you anticipate Spark 3.3.0 to be released due to 
the Log4j CVE’s.
Our customers are all quite concerned.


Regards;
John Crowe
TDi Technologies, Inc.
1600 10th Street Suite B
Plano, TX  75074
(800) 695-1258
supp...@tditechnologies.com

From: Juan Liu 
Sent: Wednesday, January 12, 2022 8:50 AM
To: user@spark.apache.org
Cc: Theodore J Griesenbrock 
Subject: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target 
release day for Spark3.3?

Dear Spark support,

Due to the known log4j security issue, we are required to upgrade log4j version 
to 2.17.1. Currently, we use Spark3.1.2 with default log4j 1.2.17. Also we 
found log4j configuration document here:  
https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging

Our questions:

  *   Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from 1.* to 
2.17.1 in Spark? would you pls help to provide guidance?
  *   If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark 3.2? 
pls also help to provide guidance, thanks!
  *   We found Spark 3.3 will support log4j migrate from 1 to 2 in this ticket: 
https://issues.apache.org/jira/browse/SPARK-37814, also I noticed all sub-tasks 
are done except one.  it's awesome! would you pls help to advise your target 
release day? if it's in very near future, like Jan, maybe we can wait for 3.3.

BTW, as log4j issue is very popular security issue, it's better if Spark team 
could post the solution directly in security page 
(https://spark.apache.org/security.html) to benefit end user.

Anyway, thank you so much for providing such a powerful tool for us, and thanks 
for your patience to read and reply this mail. Have a good day!
Juan Liu (刘娟) PMP®
Release Management, Watson Health, China Development Lab
Email: liuj...@cn.ibm.com
Phone: 86-10-82452506




Re: Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Artemis User
There was a discussion on this issue couple of weeks ago.  Basically if 
you look at the CVE definition of Log4j, the vulnerability only affects 
certain versions of log4j 2.x, not 1.x.  Since Spark doesn't use any of 
the affected log4j versions, this shouldn't be a concern..


https://lists.apache.org/list?user@spark.apache.org:lte=1M:Log4j

On 1/12/22 9:50 AM, Juan Liu wrote:

Dear Spark support,

Due to the known log4j security issue, we are required to upgrade 
log4j version to 2.17.1. Currently, we use Spark3.1.2 with default 
log4j 1.2.17. Also we found log4j configuration document here: 
https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging 



Our questions:

  * Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from
1.* to 2.17.1 in Spark? would you pls help to provide guidance?
  * If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark
3.2? pls also help to provide guidance, thanks!
  * We found Spark 3.3 will support log4j migrate from 1 to 2 in this
ticket: https://issues.apache.org/jira/browse/SPARK-37814
, also I
noticed all sub-tasks are done except one.  it's awesome! would
you pls help to advise your target release day? if it's in very
near future, like Jan, maybe we can wait for 3.3.


BTW, as log4j issue is very popular security issue, it's better if 
Spark team could post the solution directly in security page 
(https://spark.apache.org/security.html 
) to benefit end user.


Anyway, thank you so much for providing such a powerful tool for us, 
and thanks for your patience to read and reply this mail. Have a good day!


*Juan Liu (刘娟) **PMP**®*



Release Management, Watson Health, China Development Lab
Email: liuj...@cn.ibm.com
Phone: 86-10-82452506   












Does Spark 3.1.2/3.2 support log4j 2.17.1+, and how? your target release day for Spark3.3?

2022-01-12 Thread Juan Liu
Dear Spark support,

Due to the known log4j security issue, we are required to upgrade log4j 
version to 2.17.1. Currently, we use Spark3.1.2 with default log4j 1.2.17. 
Also we found log4j configuration document here:  
https://spark.apache.org/docs/3.2.0/configuration.html#configuring-logging

Our questions:
Does Spark 3.1.2 support log4j v2.17.1? how to upgrade log4j from 1.* to 
2.17.1 in Spark? would you pls help to provide guidance? 
If Spark 3.1.2 doesn't support log4j v2.17.1, then how about Spark 3.2? 
pls also help to provide guidance, thanks!
We found Spark 3.3 will support log4j migrate from 1 to 2 in this ticket: 
https://issues.apache.org/jira/browse/SPARK-37814 , also I noticed all 
sub-tasks are done except one.  it's awesome! would you pls help to advise 
your target release day? if it's in very near future, like Jan, maybe we 
can wait for 3.3. 

BTW, as log4j issue is very popular security issue, it's better if Spark 
team could post the solution directly in security page (
https://spark.apache.org/security.html) to benefit end user. 

Anyway, thank you so much for providing such a powerful tool for us, and 
thanks for your patience to read and reply this mail. Have a good day!


Juan Liu (刘娟) PMP®




Release Management, Watson Health, China Development Lab
Email: liuj...@cn.ibm.com
Phone: 86-10-82452506 











RE: Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Alana Young
I have updated the gist 
(https://gist.github.com/ally1221/5acddd9650de3dc67f6399a4687893aa 
). Please 
let me know if there are any additional questions.

Re: pyspark loop optimization

2022-01-12 Thread Ramesh Natarajan
Sorry for confusing with cume_dist and percent_rank. I was playing around with 
these to see if the difference in computation made any difference. I must have 
copied the percent rank accidentally. My requirement is to compute cume_dist. 

I have a dataframe with a bunch of columns (10+ columns) that I would need to 
create a cume_dist on, based on the partition key of customer name, site name 
and year_month columns. 

The columns where I need to run cume_dist on could be null and I want to 
exclude those from the calculations. 

Ideally it would be nice to have a option in cume_dist to ignore nulls  
F.cume_dist(ignoreNulls=True).over(…) but no such option exists.

I can create a new dataframe df2 with filtered output of dataframe df and run 
cume_dist and finally merge it with left join with df. 

Question is how do I do this efficiently for a list of columns?

Writing this in a loop and joining at the end of each loop looks 
counter-intuitive. Is there a better way to do this?


> On Jan 11, 2022, at 11:53 PM, Gourav Sengupta  
> wrote:
> 
> 
> Hi,
> 
> I am not sure what you are trying to achieve here are cume_dist and 
> percent_rank not different?
> 
> If am able to follow your question correctly, you are looking for filtering 
> our NULLs before applying the function on the dataframe, and I think it will 
> be fine if you just create another dataframe first with the non null values 
> and then apply the function to that dataframe.
> 
> It will be of much help if you can explain what are you trying to achieve 
> here. Applying loops on dataframe, like you have done in the dataframe is 
> surely not recommended at all, please see the explain plan of the dataframe 
> in each iteration to understand the effect of your loops on the explain plan 
> - that should give some details.
> 
> 
> Regards,
> Gourav Sengupta
> 
>> On Mon, Jan 10, 2022 at 10:49 PM Ramesh Natarajan  wrote:
>> I want to compute cume_dist on a bunch of columns in a spark dataframe, but 
>> want to remove NULL values before doing so. 
>> 
>> I have this loop in pyspark. While this works, I see the driver runs at 100% 
>> while the executors are idle for the most part. I am reading that running a 
>> loop is an anti-pattern and should be avoided. Any pointers on how to 
>> optimize this section of pyspark code?
>> 
>> I am running this on  the AWS Glue 3.0 environment.
>> 
>> for column_name, new_col in [
>> ("event_duration", "percentile_rank_evt_duration"),
>> ("event_duration_pred", "percentile_pred_evt_duration"),
>> ("alarm_cnt", "percentile_rank_alarm_cnt"),
>> ("alarm_cnt_pred", "percentile_pred_alarm_cnt"),
>> ("event_duration_adj", "percentile_rank_evt_duration_adj"),
>> ("event_duration_adj_pred", "percentile_pred_evt_duration_adj"),
>> ("encounter_time", "percentile_rank_encounter_time"),
>> ("encounter_time_pred", "percentile_pred_encounter_time"),
>> ("encounter_time_adj", "percentile_rank_encounter_time_adj"),
>> ("encounter_time_adj_pred", "percentile_pred_encounter_time_adj"),
>> ]:
>> win = (
>> Window().partitionBy(["p_customer_name", "p_site_name", 
>> "year_month"])
>>  .orderBy(col(column_name))
>> )
>> df1 = df.filter(F.col(column_name).isNull())
>> df2 = df.filter(F.col(column_name).isNotNull()).withColumn(
>> new_col, F.round(F.cume_dist().over(win) * 
>> lit(100)).cast("integer")
>> )
>> df = df2.unionByName(df1, allowMissingColumns=True)
>> 
>> For some reason this code seems to work faster, but it doesn't remove NULLs 
>> prior to computing the cume_dist. Not sure if this is also a proper way to 
>> do this :(
>> 
>> for column_name, new_col in [
>> ("event_duration", "percentile_rank_evt_duration"),
>> ("event_duration_pred", "percentile_pred_evt_duration"),
>> ("alarm_cnt", "percentile_rank_alarm_cnt"),
>> ("alarm_cnt_pred", "percentile_pred_alarm_cnt"),
>> ("event_duration_adj", "percentile_rank_evt_duration_adj"),
>> ("event_duration_adj_pred", "percentile_pred_evt_duration_adj"),
>> ("encounter_time", "percentile_rank_encounter_time"),
>> ("encounter_time_pred", "percentile_pred_encounter_time"),
>> ("encounter_time_adj", "percentile_rank_encounter_time_adj"),
>> ("encounter_time_adj_pred", "percentile_pred_encounter_time_adj"),
>> ]:
>> win = (
>> Window().partitionBy(["p_customer_name", "p_site_name", 
>> "year_month"])
>> .orderBy(col(column_name))
>> )
>> df = df.withColumn(
>> new_col,
>> F.when(F.col(column_name).isNull(), F.lit(None)).otherwise(
>> F.round(F.percent_rank().over(win) * 
>> lit(100)).cast("integer")
>> ),
>> )
>> 
>> Appreciate if anyone has any pointers on how to go about this..
>> 
>> thanks
>> Ramesh


Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Gourav Sengupta
Hi,

may be I have less time, but can you please add some inline comments in
your code to explain what you are trying to do?

Regards,
Gourav Sengupta



On Tue, Jan 11, 2022 at 5:29 PM Alana Young  wrote:

> I am experimenting with creating and persisting ML pipelines using custom
> transformers (I am using Spark 3.1.2). I was able to create a transformer
> class (for testing purposes, I modeled the code off the SQLTransformer
> class) and save the pipeline model. When I attempt to load the saved
> pipeline model, I am running into the following error:
>
> java.lang.NullPointerException at
> java.base/java.lang.reflect.Method.invoke(Method.java:559) at
> org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstanceReader(ReadWrite.scala:631)
> at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:276)
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
> at
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
> at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) at
> org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) at
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
> at
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
> at scala.util.Try$.apply(Try.scala:213) at
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
> at
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
> ... 38 elided
>
>
> Here is a gist
>  containing
> the relevant code. Any feedback and advice would be appreciated. Thank
> you.
>