Re: Array indexing functions
Hi, as far as I know these are not standard functions. Writing UDFs is easy, but only in Java and Scala is it equally efficient as a built-in function. When using Python, data movement/conversion to/from Arrow is still necessary, and that makes a difference in performance. That was the motivation behind these two. I'd object to the rule of not implementing functions not found anywhere else, but there seems to be a consensus around this, so I'll just close the JIRA. Thanks, Petar Sean Owen writes: > Is it standard SQL or implemented in Hive? Because UDFs are so relatively > easy in Spark we don't need tons of builtins like an RDBMS does. > > On Tue, Feb 5, 2019, 7:43 AM Petar Zečević > Hi everybody, > I finally created the JIRA ticket and the pull request for the two array > indexing functions: > https://issues.apache.org/jira/browse/SPARK-26826 > > Can any of the committers please check it out? > > Thanks, > Petar > > Petar Zečević writes: > > > Hi, > > I implemented two array functions that are useful to us and I wonder if > you think it would be useful to add them to the distribution. The functions > are used for filtering arrays based on indexes: > > > > array_allpositions (named after array_position) - takes a column and a > value and returns an array of the column's indexes corresponding to elements > equal to the provided value > > > > array_select - takes an array column and an array of indexes and returns a > subset of the array based on the provided indexes. > > > > If you agree with this addition I can create a JIRA ticket and a pull > request. > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Array indexing functions
Hi everybody, I finally created the JIRA ticket and the pull request for the two array indexing functions: https://issues.apache.org/jira/browse/SPARK-26826 Can any of the committers please check it out? Thanks, Petar Petar Zečević writes: > Hi, > I implemented two array functions that are useful to us and I wonder if you > think it would be useful to add them to the distribution. The functions are > used for filtering arrays based on indexes: > > array_allpositions (named after array_position) - takes a column and a value > and returns an array of the column's indexes corresponding to elements equal > to the provided value > > array_select - takes an array column and an array of indexes and returns a > subset of the array based on the provided indexes. > > If you agree with this addition I can create a JIRA ticket and a pull request. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Array indexing functions
Hi, I implemented two array functions that are useful to us and I wonder if you think it would be useful to add them to the distribution. The functions are used for filtering arrays based on indexes: array_allpositions (named after array_position) - takes a column and a value and returns an array of the column's indexes corresponding to elements equal to the provided value array_select - takes an array column and an array of indexes and returns a subset of the array based on the provided indexes. If you agree with this addition I can create a JIRA ticket and a pull request. -- Petar Zečević - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: code freeze and branch cut for Apache Spark 2.4
Hi, I made some changes to SPARK-24020 (https://github.com/apache/spark/pull/21109) and implemented spill-over to disk. I believe there are no objections to the implementation left and that this can now be merged. Please take a look. Thanks, Petar Zečević Wenchen Fan @ 1970-01-01 01:00 CET: > Some updates for the JIRA tickets that we want to resolve before Spark 2.4. > > green: merged > orange: in progress > red: likely to miss > > SPARK-24374: Support Barrier Execution Mode in Apache Spark > The core functionality is finished, but we still need to add Python API. > Tracked by SPARK-24822 > > SPARK-23899: Built-in SQL Function Improvement > I think it's ready to go. Although there are still some functions working in > progress, the common ones are all merged. > > SPARK-14220: Build and test Spark against Scala 2.12 > It's close, just one last piece. Tracked by SPARK-25029 > > SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet > Being reviewed. > > SPARK-24882: data source v2 API improvement > PR is out, being reviewed. > > SPARK-24252: Add catalog support in Data Source V2 > Being reviewed. > > SPARK-24768: Have a built-in AVRO data source implementation > It's close, just one last piece: the decimal type support > > SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers > It turns out to be a very complicated issue, there is no consensus about what > is the right fix yet. Likely to miss it in Spark 2.4 because it's a > long-standing issue, not a regression. > > SPARK-24598: Datatype overflow conditions gives incorrect result > We decided to keep the current behavior in Spark 2.4 and add some > document(already done). We will re-consider this change in Spark 3.0. > > SPARK-24020: Sort-merge join inner range optimization > There are some discussions about the design, I don't think we can get to a > consensus within Spark 2.4. > > SPARK-24296: replicating large blocks over 2GB > Being reviewed. > > SPARK-23874: upgrade to Apache Arrow 0.10.0 > Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should > wait a few days. > > According to the status, I think we should wait a few more days. Any > objections? > > Thanks, > Wenchen > > On Tue, Aug 7, 2018 at 3:39 AM Sean Owen wrote: > > ... and we still have a few snags with Scala 2.12 support at > https://issues.apache.org/jira/browse/SPARK-25029 > > There is some hope of resolving it on the order of a week, so for the > moment, seems worth holding 2.4 for. > > On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler wrote: > > Hi All, > > I'd like to request a few days extension to the code freeze to complete the > upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several > key improvements and bug fixes. The RC vote just passed this morning and code > changes are complete in https://github.com/apache/spark/pull/21939. We just > need some time for the release artifacts to be available. Thoughts? > > Thanks, > Bryan - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: code freeze and branch cut for Apache Spark 2.4
This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I think it could be useful to others too. It is finished and is ready to be merged (was ready a month ago at least). Do you think you could consider including it in 2.4? Petar Wenchen Fan @ 1970-01-01 01:00 CET: > I went through the open JIRA tickets and here is a list that we should > consider for Spark 2.4: > > High Priority: > SPARK-24374: Support Barrier Execution Mode in Apache Spark > This one is critical to the Spark ecosystem for deep learning. It only has a > few remaining works and I think we should have it in Spark 2.4. > > Middle Priority: > SPARK-23899: Built-in SQL Function Improvement > We've already added a lot of built-in functions in this release, but there > are a few useful higher-order functions in progress, like `array_except`, > `transform`, etc. It would be great if we can get them in Spark 2.4. > > SPARK-14220: Build and test Spark against Scala 2.12 > Very close to finishing, great to have it in Spark 2.4. > > SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet > This one is there for years (thanks for your patience Michael!), and is also > close to finishing. Great to have it in 2.4. > > SPARK-24882: data source v2 API improvement > This is to improve the data source v2 API based on what we learned during > this release. From the migration of existing sources and design of new > features, we found some problems in the API and want to address them. I > believe this should be > the last significant API change to data source v2, so great to have in Spark > 2.4. I'll send a discuss email about it later. > > SPARK-24252: Add catalog support in Data Source V2 > This is a very important feature for data source v2, and is currently being > discussed in the dev list. > > SPARK-24768: Have a built-in AVRO data source implementation > Most of it is done, but date/timestamp support is still missing. Great to > have in 2.4. > > SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers > This is a long-standing correctness bug, great to have in 2.4. > > There are some other important features like the adaptive execution, > streaming SQL, etc., not in the list, since I think we are not able to finish > them before 2.4. > > Feel free to add more things if you think they are important to Spark 2.4 by > replying to this email. > > Thanks, > Wenchen > > On Mon, Jul 30, 2018 at 11:00 PM Sean Owen wrote: > > In theory releases happen on a time-based cadence, so it's pretty much wrap > up what's ready by the code freeze and ship it. In practice, the cadence > slips frequently, and it's very much a negotiation about what features should > push the > code freeze out a few weeks every time. So, kind of a hybrid approach here > that works OK. > > Certainly speak up if you think there's something that really needs to get > into 2.4. This is that discuss thread. > > (BTW I updated the page you mention just yesterday, to reflect the plan > suggested in this thread.) > > On Mon, Jul 30, 2018 at 9:51 AM Tom Graves > wrote: > > Shouldn't this be a discuss thread? > > I'm also happy to see more release managers and agree the time is getting > close, but we should see what features are in progress and see how close > things are and propose a date based on that. Cutting a branch to soon just > creates > more work for committers to push to more branches. > > http://spark.apache.org/versioning-policy.html mentioned the code freeze > and release branch cut mid-august. > > Tom - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org