Re: Array indexing functions

2019-02-07 Thread Petar Zečević


Hi,
as far as I know these are not standard functions.

Writing UDFs is easy, but only in Java and Scala is it equally efficient as a 
built-in function. When using Python, data movement/conversion to/from Arrow is 
still necessary, and that makes a difference in performance. That was the 
motivation behind these two.

I'd object to the rule of not implementing functions not found anywhere else, 
but there seems to be a consensus around this, so I'll just close the JIRA.

Thanks,
Petar


Sean Owen  writes:

> Is it standard SQL or implemented in Hive? Because UDFs are so relatively 
> easy in Spark we don't need tons of builtins like an RDBMS does. 
>
> On Tue, Feb 5, 2019, 7:43 AM Petar Zečević 
>  Hi everybody,
>  I finally created the JIRA ticket and the pull request for the two array 
> indexing functions:
>  https://issues.apache.org/jira/browse/SPARK-26826
>
>  Can any of the committers please check it out?
>
>  Thanks,
>  Petar
>
>  Petar Zečević  writes:
>
>  > Hi,
>  > I implemented two array functions that are useful to us and I wonder if 
> you think it would be useful to add them to the distribution. The functions 
> are used for filtering arrays based on indexes:
>  >
>  > array_allpositions (named after array_position) - takes a column and a 
> value and returns an array of the column's indexes corresponding to elements 
> equal to the provided value
>  >
>  > array_select - takes an array column and an array of indexes and returns a 
> subset of the array based on the provided indexes.
>  >
>  > If you agree with this addition I can create a JIRA ticket and a pull 
> request.
>
>  -
>  To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Array indexing functions

2019-02-05 Thread Petar Zečević


Hi everybody,
I finally created the JIRA ticket and the pull request for the two array 
indexing functions:
https://issues.apache.org/jira/browse/SPARK-26826

Can any of the committers please check it out?

Thanks,
Petar


Petar Zečević  writes:

> Hi,
> I implemented two array functions that are useful to us and I wonder if you 
> think it would be useful to add them to the distribution. The functions are 
> used for filtering arrays based on indexes:
>
> array_allpositions (named after array_position) - takes a column and a value 
> and returns an array of the column's indexes corresponding to elements equal 
> to the provided value
>
> array_select - takes an array column and an array of indexes and returns a 
> subset of the array based on the provided indexes.
>
> If you agree with this addition I can create a JIRA ticket and a pull request.



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Array indexing functions

2018-11-20 Thread Petar Zečević


Hi,
I implemented two array functions that are useful to us and I wonder if you 
think it would be useful to add them to the distribution. The functions are 
used for filtering arrays based on indexes:

array_allpositions (named after array_position) - takes a column and a value 
and returns an array of the column's indexes corresponding to elements equal to 
the provided value

array_select - takes an array column and an array of indexes and returns a 
subset of the array based on the provided indexes.

If you agree with this addition I can create a JIRA ticket and a pull request.

-- 
Petar Zečević

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-08-11 Thread Petar Zečević


Hi, I made some changes to SPARK-24020 
(https://github.com/apache/spark/pull/21109) and implemented spill-over to 
disk. I believe there are no objections to the implementation left and that 
this can now be merged.

Please take a look.

Thanks,

Petar Zečević


Wenchen Fan @ 1970-01-01 01:00 CET:

> Some updates for the JIRA tickets that we want to resolve before Spark 2.4.
>
> green: merged
> orange: in progress
> red: likely to miss
>
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> The core functionality is finished, but we still need to add Python API. 
> Tracked by SPARK-24822
>
> SPARK-23899: Built-in SQL Function Improvement
> I think it's ready to go. Although there are still some functions working in 
> progress, the common ones are all merged.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> It's close, just one last piece. Tracked by SPARK-25029
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> Being reviewed.
>
> SPARK-24882: data source v2 API improvement
> PR is out, being reviewed.
>
> SPARK-24252: Add catalog support in Data Source V2
> Being reviewed.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> It's close, just one last piece: the decimal type support
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about what 
> is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
> long-standing issue, not a regression.
>
> SPARK-24598: Datatype overflow conditions gives incorrect result
> We decided to keep the current behavior in Spark 2.4 and add some 
> document(already done). We will re-consider this change in Spark 3.0.
>
> SPARK-24020: Sort-merge join inner range optimization
> There are some discussions about the design, I don't think we can get to a 
> consensus within Spark 2.4.
>
> SPARK-24296: replicating large blocks over 2GB
> Being reviewed.
>
> SPARK-23874: upgrade to Apache Arrow 0.10.0
> Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should 
> wait a few days.
>
> According to the status, I think we should wait a few more days. Any 
> objections?
>
> Thanks,
> Wenchen
>
> On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:
>
>  ... and we still have a few snags with Scala 2.12 support at 
> https://issues.apache.org/jira/browse/SPARK-25029 
>
>  There is some hope of resolving it on the order of a week, so for the 
> moment, seems worth holding 2.4 for.
>
>  On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>  Hi All,
>
>  I'd like to request a few days extension to the code freeze to complete the 
> upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several 
> key improvements and bug fixes.  The RC vote just passed this morning and code
>  changes are complete in https://github.com/apache/spark/pull/21939. We just 
> need some time for the release artifacts to be available. Thoughts?
>
>  Thanks,
>  Bryan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Petar Zečević


This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 
(Sort-merge join inner range optimization) but I think it could be useful to 
others too. 

It is finished and is ready to be merged (was ready a month ago at least).

Do you think you could consider including it in 2.4?

Petar


Wenchen Fan @ 1970-01-01 01:00 CET:

> I went through the open JIRA tickets and here is a list that we should 
> consider for Spark 2.4:
>
> High Priority:
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has a 
> few remaining works and I think we should have it in Spark 2.4.
>
> Middle Priority:
> SPARK-23899: Built-in SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there 
> are a few useful higher-order functions in progress, like `array_except`, 
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is also 
> close to finishing. Great to have it in 2.4.
>
> SPARK-24882: data source v2 API improvement
> This is to improve the data source v2 API based on what we learned during 
> this release. From the migration of existing sources and design of new 
> features, we found some problems in the API and want to address them. I 
> believe this should be
> the last significant API change to data source v2, so great to have in Spark 
> 2.4. I'll send a discuss email about it later.
>
> SPARK-24252: Add catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently being 
> discussed in the dev list.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to 
> have in 2.4.
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution, 
> streaming SQL, etc., not in the list, since I think we are not able to finish 
> them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4 by 
> replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>  In theory releases happen on a time-based cadence, so it's pretty much wrap 
> up what's ready by the code freeze and ship it. In practice, the cadence 
> slips frequently, and it's very much a negotiation about what features should 
> push the
>  code freeze out a few weeks every time. So, kind of a hybrid approach here 
> that works OK. 
>
>  Certainly speak up if you think there's something that really needs to get 
> into 2.4. This is that discuss thread.
>
>  (BTW I updated the page you mention just yesterday, to reflect the plan 
> suggested in this thread.)
>
>  On Mon, Jul 30, 2018 at 9:51 AM Tom Graves  
> wrote:
>
>  Shouldn't this be a discuss thread?  
>
>  I'm also happy to see more release managers and agree the time is getting 
> close, but we should see what features are in progress and see how close 
> things are and propose a date based on that.  Cutting a branch to soon just 
> creates
>  more work for committers to push to more branches. 
>
>   http://spark.apache.org/versioning-policy.html mentioned the code freeze 
> and release branch cut mid-august.
>
>  Tom


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org