Hi Tim,
I think you can try setting the option *spark.sql.files.ignoreCorruptFiles *as
*true*. With the option enabled, the Spark jobs will continue to run
when encountering corrupted files and the contents that have been read will
still be returned.
The CSV/JSON data source supports the Permissiv
Thanks for the guidance. That was my initial inclination, but I decided that
consistency with the existing ‘hash’ was better. However, like you, I also
prefer the specific form.
I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the
patch (using ‘xxhash64’) at https://g
Please vote on releasing the following candidate as Apache Spark version 2.4.1.
The vote is open until March 11 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.
[ ] +1 Release this package as Apache Spark 2.4.1
[ ] -1 Do not release this package because ...
To l
/facepalm
Here we go: https://issues.apache.org/jira/browse/SPARK-27093
Tim
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Thanks Xiao, it's good to have that validated.
I've created a ticket here: https://issues.apache.org/jira/browse/AVRO-2342
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e-mail: dev-u
right now, i'm using the colums-at-a-time mapping
https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
On Thu, Mar 7, 2019 at 4:00 PM Sean Owen wrote:
> Maybe, it depends on what you're doing. It sounds like you are trying
> to do row-at-a-time mapping, even on a pandas Da
Maybe, it depends on what you're doing. It sounds like you are trying
to do row-at-a-time mapping, even on a pandas DataFrame. Is what
you're doing vectorized? may not help much.
Just make the pandas Series into a DataFrame if you want? and a single
col back to Series?
On Thu, Mar 7, 2019 at 2:45
Hi Spark Devs,
We're processing a large number of Avro files with Spark and found that the
Avro reader is missing the ability to handle malformed or truncated files
like the JSON reader. Currently the Avro reader throws exceptions when it
encounters any bad or truncated record in an Avro file, cau
it is very similar to SCALAR, but for SCALAR the output can't be struct/row
and the input has to be pd.Series, which doesn't support a row.
I'm doing tensorflow batch inference in spark,
https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
Which i have to do the groupBy in
Thanks Ryan and Reynold for the information!
Cheers,
Tyson
From: Ryan Blue
Sent: Wednesday, March 6, 2019 3:47 PM
To: Reynold Xin
Cc: tcon...@gmail.com; Spark Dev List
Subject: Re: Hive Hash in Spark
I think this was needed to add support for bucketed Hive tables. Like Tyson
noted
and in this case, i'm actually benefiting from the columns of arrow
support, so that i can pass the whole data block to tensorflow to obtain
the block of prediction all at once.
On Thu, Mar 7, 2019 at 3:45 PM peng yu wrote:
> pandas/arrow is for the memory efficiency, and mapPartitions is only
pandas/arrow is for the memory efficiency, and mapPartitions is only
available to rdds, for sure i can do everything in rdd.
But i thought that's the whole point of having pandas_udf, so my program
run faster and consumes less memory ?
On Thu, Mar 7, 2019 at 3:40 PM Sean Owen wrote:
> Are you j
Are you just applying a function to every row in the DataFrame? you
don't need pandas at all. Just get the RDD of Row from it and map a
UDF that makes another Row, and go back to DataFrame. Or make a UDF
that operates on all columns and returns a new value. mapPartitions is
also available if you wa
Are you looking for SCALAR? that lets you map one row to one row, but
do it more efficiently in batch. What are you trying to do?
On Thu, Mar 7, 2019 at 2:03 PM peng yu wrote:
>
> I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe.
>
> ```
> @pandas_udf(df.schema, PandasUDFType.M
I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe.
```
@pandas_udf(df.schema, PandasUDFType.MAP)
def do_nothing(pandas_df):
return pandas_df
new_df = df.mapPartition(do_nothing)
```
pandas_udf only support scala or GROUPED_MAP. Why not support just Map?
On Thu, Mar 7, 201
Are you looking for @pandas_udf in Python? Or just mapPartition? Those
exist already
On Thu, Mar 7, 2019, 1:43 PM peng yu wrote:
> There is a nice map_partition function in R `dapply`. so that user can
> pass a row to udf.
>
> I'm wondering why we don't have that in python?
>
> I'm trying to ha
There is a nice map_partition function in R `dapply`. so that user can
pass a row to udf.
I'm wondering why we don't have that in python?
I'm trying to have a map_partition function with pandas_udf supported
thanks!
https://issues.apache.org/jira/browse/SPARK-26742
On Thu, Mar 7, 2019 at 10:52 AM shane knapp wrote:
> i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2
> client:
> https://issues.apache.org/jira/browse/SPARK-2674
>
> i am more than comfortable with this build system upd
i'm ready to update the ubuntu workers/minikube/k8s to support the 4.1.2
client:
https://issues.apache.org/jira/browse/SPARK-2674
i am more than comfortable with this build system update, both on the ops
and spark project side. we were incredibly far behind the release cycle
for k8s and minikube,
Saisai,
There is no blocker now. I ran into some difficulties in publishing the
jars into Nexus. The publish task was finished, but Nexus gave me the
following error.
*failureMessage Failed to validate the pgp signature of
'/org/apache/spark/spark-streaming-flume-assembly_2.11/2.4.1/spark-stream
There is SPARK-26604 we are looking into
From: Saisai Shao
Sent: Wednesday, March 6, 2019 6:05 PM
To: shane knapp
Cc: Stavros Kontopoulos; Sean Owen; DB Tsai; Spark dev list; d_t...@apple.com
Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC2)
Do we have other b
Do you know what change fixed it?
If it's not a regression from 2.4.0 it wouldn't necessarily go into a
maintenance release. If there were no downside, maybe; does it cause
any incompatibility with older HBase versions?
It may be that this support is targeted for Spark 3 on purpose, which
is probab
Hello,
I have a question regarding the 2.4.1 release.
It looks like Spark 2.4 (and 2.4.1-rc) is not exactly compatible with Hbase
2.x+ for the Yarn mode.
The problem is in the
org.apache.spark.deploy.security.HbaseDelegationTokenProvider class that
expects a specific version of TokenUtil class
I'm sure just the first one listen on port, but in master UI, both
application redirects to same machie, same port. Just as I checked url,
they redirects to application ui of first sumbitted one. So I think it
could be only problem in UI.
On Wed, Mar 6, 2019, 10:29 PM Sean Owen wrote:
> Two driv
24 matches
Mail list logo