Unsubscribe

2024-01-13 Thread Andrew Redd
Unsubscribe


Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove

On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh 
wrote:

> OK Spark Structured Streaming.
>
> How are you getting messages into Spark?  Is it Kafka?
>
> This to me index that the message is incomplete or having another value in
> Json
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 5 Apr 2023 at 12:58, me  wrote:
>
>> Dear Apache Spark users,
>> I have a long running Spark application that is encountering an
>> ArrayIndexOutOfBoundsException once every two weeks. The exception does not
>> disrupt the operation of my app, but I'm still concerned about it and would
>> like to find a solution.
>>
>> Here's some additional information about my setup:
>>
>> Spark is running in standalone mode
>> Spark version is 3.3.1
>> Scala version is 2.12.15
>> I'm using Spark in Structured Streaming
>>
>> Here's the relevant error message:
>> java.lang.ArrayIndexOutOfBoundsException Index 59 out of bounds for
>> length 16
>> I've reviewed the code and searched online, but I'm still unable to find
>> a solution. The full stacktrace can be found at this link:
>> https://gist.github.com/rsi2m/ae54eccac93ae602d04d383e56c1a737
>> I would appreciate any insights or suggestions on how to resolve this
>> issue. Thank you in advance for your help.
>>
>> Best regards,
>> rsi2m
>>
>>
>>


Re: Recover RFormula Column Names

2019-10-29 Thread Andrew Redd
Thanks Alessandro!

That did the trick. I all of the indices and interactions are in the
metadata. I also wanted to confirm that this solution works in pyspark as
the metadata is carried over.

Andrew

On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Hello Andrew,
> few years ago I had the same need and I found this SO's answer
> <https://stackoverflow.com/a/36306784/898154> the way to go.
>
> Here an extract of my (Scala) code (which was doing other things on
> top), I have removed the irrelevant parts but without testing it, so it
> might not work out of the box, nonetheless it should help you starting:
>
>private def getEncodedVectorLookupTable(df: DataFrame,
>
>   featuresColName: String):
>> Map[Long, String] = {
>
>  val meta = df.select(featuresColName)
>>   .schema.fields.head.metadata
>>   .getMetadata("ml_attr")
>>   .getMetadata("attrs")
>>
>
>
> /* REFLECTION START */
>> val field = meta.getClass.getDeclaredField("map")
>> field.setAccessible(true)
>> val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>> field.setAccessible(false)
>> /* REFLECTION END */
>
>
>
> keys.flatMap(
>>   meta.getMetadataArray(_)
>> .map(m => m.getLong("idx") -> m.getString("name"))
>> ).toMap
>
>  }
>
>
> It looks like there is some support now for achieving this, but I have
> never tried it:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html
>
> Best regards,
> Alessandro
>
> On Mon, 28 Oct 2019 at 21:01, Andrew Redd  wrote:
>
>>
>> Hi All!
>>
>> I'm performing an econometric analysis over several billion rows of data
>> and would like to use the Pyspark SparkML implementation of linear
>> regression. In the example below I'm trying to interact hour of day and
>> month of year indicators. The StringIndexer documentation tells you what
>> it's doing when it's one hot encoding string/factor columns (i.e. taking
>> out the most/least common value or first/last when sorted alphabetically)
>> but doesn't allow you to recover your coefficient names. This feels like
>> such a general case that I must be missing something. How can I get my
>> column names back post regression to map to coefficient values? Do I need
>> to basically rebuild the RFormula logic in if this isn't already
>> implemented? Would be happy to use a different Spark language (Scala/Java
>> etc. ) if implemented there.
>>
>> Thanks in advance
>>
>> Andrew
>>
>> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
>> month_of_year + hour_of_day:month_of_year + additional_column",
>>  featuresCol="features",
>>  labelCol="label")
>>
>> rform_regression_input =
>> rform.fit(regression_input).transform(regression_input)
>>
>> lr = LinearRegression(featuresCol='features',
>>  labelCol='label',
>>  solver='normal')
>>
>> lr_model = lr.fit(rform_regression_input)
>> coefs = [ *lr_model.coefficients, lr_model.intercept]
>>
>> return pd.DataFrame(
>> {"pvalues": lr_model.summary.pValues,
>>  "tvalues": lr_model.summary.tValues,
>>  "std_errs": lr_model.summary.coefficientStandardErrors,
>>  "coefs": coefs}
>> )
>>
>>


Fwd: Recover RFormula Column Names

2019-10-28 Thread Andrew Redd
Hi All!

I'm performing an econometric analysis over several billion rows of data
and would like to use the Pyspark SparkML implementation of linear
regression. In the example below I'm trying to interact hour of day and
month of year indicators. The StringIndexer documentation tells you what
it's doing when it's one hot encoding string/factor columns (i.e. taking
out the most/least common value or first/last when sorted alphabetically)
but doesn't allow you to recover your coefficient names. This feels like
such a general case that I must be missing something. How can I get my
column names back post regression to map to coefficient values? Do I need
to basically rebuild the RFormula logic in if this isn't already
implemented? Would be happy to use a different Spark language (Scala/Java
etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
month_of_year + hour_of_day:month_of_year + additional_column",
 featuresCol="features",
 labelCol="label")

rform_regression_input =
rform.fit(regression_input).transform(regression_input)

lr = LinearRegression(featuresCol='features',
 labelCol='label',
 solver='normal')

lr_model = lr.fit(rform_regression_input)
coefs = [ *lr_model.coefficients, lr_model.intercept]

return pd.DataFrame(
{"pvalues": lr_model.summary.pValues,
 "tvalues": lr_model.summary.tValues,
 "std_errs": lr_model.summary.coefficientStandardErrors,
 "coefs": coefs}
)