[ANNOUNCE] DataFusion Comet regular meetup
*Note: I had previously sent a version of this email to the new DataFusion dev@ mailing list, but I don't think many people have migrated to that yet, so I am sending it to dev@ arrow as well.* Hello, I would like to invite anyone interested to join a regular meetup to discuss the DataFusion Comet project with some of the core contributors. This is a great place for new contributors to ask questions and coordinate on working on issues. The details are in the attached Google document [1], and the first meeting will be tomorrow, Wednesday, May 1st. Thanks, Andy. [1] https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit?usp=sharing
Re: [DISCUSS] Drop Java 8 support
Speaking for my own product we would like to see Java 11 support, we rely heavily on Arrow and have Java 11 as our minimum supported version. We’d like to keep doing that if possible. Our clients are big enterprises with notoriously sluggish update cycles, so we want to offer maximum compatibility. Once security patches are no longer available on the regular public channels then there is a compliance issue, so we generally follow the EOL schedule of our dependencies. Corretto, Adoptium and Zulu all have recent public builds of both 8 and 11 and look set to support them with public builds for many years to come. Several organisations I have worked with switched away from Oracle when they made their licensing blunder with Java 8 and although that is rectified now, the change seems to have stuck in quite a few places (at least in my anecdotal experience). A major practical difference to me in Java 17 is the strong encapsulation of internals. Since that affects the majority of serious Java applications then perhaps most people have figured out by now to add the JVM params that let Java continue working. Still, it could be a consideration, if Java17 is the baseline supported version. Regards, Martin. - In case anyone is curious why we don’t support Java 8 per our own policy, it’s because of the “var” keyword - seriously, why did Java take so long with that, even C++ got there sooner! > On 30 Apr 2024, at 16:20, Jacob Wujciak wrote: > > Hello everyone! > Great to see this move forward! > +1 on dropping both 8 and 11 unless there is very good reason to keep 11 > around. > Otherwise people will just move to 11 and then have the pain of migration > again when we drop that (which will happen soon regardless imo). > > Am Di., 30. Apr. 2024 um 16:18 Uhr schrieb Dane Pitkin > : > >> Thanks, JB. Are we aware of any downstream dependencies that would benefit >> from maintaining Java 11 support? Apache Spark jumped straight to Java 17. >> It seems other projects are dropping both 8 and 11 at the same time as >> mentioned by Fokko. From a maintenance perspective, it would be nice to >> drop both. >> >> On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré >> wrote: >> >>> Hi >>> >>> I think it's time to drop JDK8 support. I would say that we should >>> keep Java11 (jumping directly to Java17 would be problematic >>> potentially for some users I guess). >>> >>> Regards >>> JB >>> >>> On Thu, Apr 25, 2024 at 10:21 PM James Duong >>> wrote: If we dropped JDK 8, we could use the JDK to compile module-info.java >>> files. Then we could remove the custom maven plugin we’re using for >>> compiling module-info.java files for JPMS support and get better IDE >>> integration (as what we’re doing currently somewhat shoe-horns module >>> information alongside JDK8 bytecode). From: Dane Pitkin Date: Thursday, April 25, 2024 at 1:02 PM To: dev@arrow.apache.org Subject: [DISCUSS] Drop Java 8 support Hi all, I would like to revisit the discussion of dropping Java 8 (and maybe >> 11) from Arrow's Java implementation. See GH issue[1] below. This was also discussed in the last Arrow community sync meeting on 2024-04-24. For context, this was discussed[2] last year on this mailing list. We decided to revisit the discussion around the June 2024 release (Arrow >>> v17). The timing coincides with the initial release of Apache Spark 4.0.0, >>> which drops both Java 8 and 11 support. For background, we chose not to drop Java 8 support last year because >>> Arrow is seen as a low level library that should support as many environments >>> as possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11) >>> as exemplified by Apache Spark as well as Apache Iceberg[3]. Is it time to consider dropping Java 8? Should we drop Java 11 and skip straight to Java 17 as our minimum version? What implications do we >> need >>> to be aware of? Thanks, Dane [1]https://github.com/apache/arrow/issues/38051 [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8 [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368 >>> >>
Re: [VOTE][Format] UUID canonical extension type
+1 (binding) Le 19/04/2024 à 22:22, Rok Mihevc a écrit : Hi all, Following initial requests [1][2] and recent tangential ML discussion [3] I would like to propose a vote to add language for UUID canonical extension type to CanonicalExtensions.rst as in PR [4] and written below. A draft C++ and Python implementation PR can be seen here [5]. [1] https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j [2] https://github.com/apache/arrow/issues/15058 [3] https://lists.apache.org/thread/8d5ldl5cb7mms21rd15lhpfrv4j9no4n [4] https://github.com/apache/arrow/pull/41299 <- proposed change [5] https://github.com/apache/arrow/pull/37298 The vote will be open for at least 72 hours. [ ] +1 Accept this proposal [ ] +0 [ ] -1 Do not accept this proposal because... UUID * Extension name: `arrow.uuid`. * The storage type of the extension is ``FixedSizeBinary`` with a length of 16 bytes. .. note:: A specific UUID version is not required or guaranteed. This extension represents UUIDs as FixedSizeBinary(16) and does not interpret the bytes in any way. Rok
Re: [VOTE][Format] JSON canonical extension type
+1 (binding) for the current proposal, i.e. with the RFC 8289 requirement and the 3 current String types allowed. Regards Antoine. Le 30/04/2024 à 19:26, Rok Mihevc a écrit : Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259 requirement as it seems to be almost unanimously requested. New language is below. To Micah's comment regarding rejecting Binary arrays [2] - please discuss in the PR. Let's leave the vote open until after the May holiday. Rok [1] https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e [2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040 JSON * Extension name: `arrow.json`. * The storage type of this extension is ``StringArray`` or or ``LargeStringArray`` or ``StringViewArray``. *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.* * Extension type parameters: This type does not have any parameters. * Description of the serialization: Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.
Re: [VOTE][Format] JSON canonical extension type
+1 (non-binding) Thanks for moving these two forward Rok! Am Di., 30. Apr. 2024 um 19:26 Uhr schrieb Rok Mihevc : > Hi all, thanks for the votes and comments so far. > I've amended [1] the proposed language with the RFC-8259 requirement as it > seems to be almost unanimously requested. New language is below. > To Micah's comment regarding rejecting Binary arrays [2] - please discuss > in the PR. > > Let's leave the vote open until after the May holiday. > > Rok > > [1] > > https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e > [2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040 > > > JSON > > > * Extension name: `arrow.json`. > > * The storage type of this extension is ``StringArray`` or > or ``LargeStringArray`` or ``StringViewArray``. > *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.* > > * Extension type parameters: > > This type does not have any parameters. > > * Description of the serialization: > > Metadata is either an empty string or a JSON string with an empty object. > In the future, additional fields may be added, but they are not required > to interpret the array. >
Re: [VOTE][Format] UUID canonical extension type
+1 (binding) On Tue, 30 Apr 2024 at 19:52, Jacob Wujciak wrote: > +1 (non-binding) > > Am Di., 30. Apr. 2024 um 17:48 Uhr schrieb Weston Pace < > weston.p...@gmail.com>: > > > +1 (binding) > > > > On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc wrote: > > > > > Thanks for all the reviews and comments! I've included the big-endian > > > requirement so the proposed language is now as below. > > > I'll leave the vote open until after the May holiday. > > > > > > Rok > > > > > > UUID > > > > > > > > > * Extension name: `arrow.uuid`. > > > > > > * The storage type of the extension is ``FixedSizeBinary`` with a > length > > of > > > 16 bytes. > > > > > > .. note:: > > >A specific UUID version is not required or guaranteed. This > extension > > > represents > > >UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not > > > interpret the bytes in any way. > > > > > >
Re: [VOTE][Format] UUID canonical extension type
+1 (non-binding) Am Di., 30. Apr. 2024 um 17:48 Uhr schrieb Weston Pace < weston.p...@gmail.com>: > +1 (binding) > > On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc wrote: > > > Thanks for all the reviews and comments! I've included the big-endian > > requirement so the proposed language is now as below. > > I'll leave the vote open until after the May holiday. > > > > Rok > > > > UUID > > > > > > * Extension name: `arrow.uuid`. > > > > * The storage type of the extension is ``FixedSizeBinary`` with a length > of > > 16 bytes. > > > > .. note:: > >A specific UUID version is not required or guaranteed. This extension > > represents > >UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not > > interpret the bytes in any way. > > >
Re: [VOTE][Format] JSON canonical extension type
Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259 requirement as it seems to be almost unanimously requested. New language is below. To Micah's comment regarding rejecting Binary arrays [2] - please discuss in the PR. Let's leave the vote open until after the May holiday. Rok [1] https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e [2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040 JSON * Extension name: `arrow.json`. * The storage type of this extension is ``StringArray`` or or ``LargeStringArray`` or ``StringViewArray``. *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.* * Extension type parameters: This type does not have any parameters. * Description of the serialization: Metadata is either an empty string or a JSON string with an empty object. In the future, additional fields may be added, but they are not required to interpret the array.
Re: [Discuss] Extension types based on canonical extension types?
I don't think there is any current barrier to using implementation features of one extension type to help with another. In Python, for example, one might be able to do: class GeoJSONExtensionType(pa.ExtensionType): def __init__(self): self._json_ext = pa.JSONExtensionType() def some_action(self): return self._json_ext.some_action() One could do something similar with the Array/Scalar classes. I am not sure there is anything "automatic" that any current implementation would be able to offer even if this information were machine parseable. The only thing I can think of is that implementations like Arrow C++ that aggressively drop extension information might be able to drop the extension type by assigning a different one; however, I am not sure that it would be useful enough to ever be implemented. -dewey On Tue, Apr 30, 2024 at 1:31 PM Ian Cook wrote: > > But consider that a user might want to define a > non-canonical HLLSKETCH extension type and make use of Arrow > implementations' features for handling JSON canonical extension type > columns in order to handle HLLSKETCH extension type columns. The spec > currently does not provide any means to enable this. I wonder if we should > consider incorporating something like this into the spec. > > For example, maybe the colon character could have the special meaning > "represented as" in extension type names, so that implementations would > recognize "hllsketch:arrow.json" as meaning: a column with extension type > hllsketch, which is represented as in the JSON canonical extension type. > > Ian > > On Tue, Apr 30, 2024 at 11:51 AM Weston Pace wrote: > > > I think "inheritance" and "composition" are more concerns for > > implementations than they are for spec (I could be wrong here). > > > > So it seems that it would be sufficient to write the HLLSKETCH's canonical > > definition as "this is an extension of the JSON logical type and supports > > all the same storage types" and then allow implementations to use whatever > > inheritance / composition scheme they want to behind the scenes. > > > > On Tue, Apr 30, 2024 at 7:47 AM Matt Topol wrote: > > > > > I think the biggest blocker to doing this is the way that we pass > > extension > > > types through IPC. Extension types are sent as their underlying storage > > > type with metadata key-value pairs of specific keys > > "ARROW:extension:name" > > > and "ARROW:extension:metadata". Since you can't have multiple values for > > > the same key in the metadata, this would prevent the ability to define an > > > extension type in terms of another extension type as you wouldn't be able > > > to include the metadata for the second-level extension part. > > > > > > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you > > > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its > > > storage type. So the storage type needs to be a valid core Arrow data > > type > > > for this reason. > > > > > > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook wrote: > > > > > > > The vote on adding a JSON canonical extension type [1] got me > > wondering: > > > Is > > > > it possible to define an extension type that is based on a canonical > > > > extension type? If so, how? > > > > > > > > For example, say I wanted to define a (non-canonical) HLLSKETCH > > extension > > > > type that corresponds to the type that Redshift uses for HyperLogLog > > > > sketches and is represented as JSON [2]. Is there a way to do this by > > > > building on the JSON canonical extension type? > > > > > > > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq > > > > [2] > > https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html > > > > > > > > Ian > > > > > > > > >
Re: [Discuss] Extension types based on canonical extension types?
But consider that a user might want to define a non-canonical HLLSKETCH extension type and make use of Arrow implementations' features for handling JSON canonical extension type columns in order to handle HLLSKETCH extension type columns. The spec currently does not provide any means to enable this. I wonder if we should consider incorporating something like this into the spec. For example, maybe the colon character could have the special meaning "represented as" in extension type names, so that implementations would recognize "hllsketch:arrow.json" as meaning: a column with extension type hllsketch, which is represented as in the JSON canonical extension type. Ian On Tue, Apr 30, 2024 at 11:51 AM Weston Pace wrote: > I think "inheritance" and "composition" are more concerns for > implementations than they are for spec (I could be wrong here). > > So it seems that it would be sufficient to write the HLLSKETCH's canonical > definition as "this is an extension of the JSON logical type and supports > all the same storage types" and then allow implementations to use whatever > inheritance / composition scheme they want to behind the scenes. > > On Tue, Apr 30, 2024 at 7:47 AM Matt Topol wrote: > > > I think the biggest blocker to doing this is the way that we pass > extension > > types through IPC. Extension types are sent as their underlying storage > > type with metadata key-value pairs of specific keys > "ARROW:extension:name" > > and "ARROW:extension:metadata". Since you can't have multiple values for > > the same key in the metadata, this would prevent the ability to define an > > extension type in terms of another extension type as you wouldn't be able > > to include the metadata for the second-level extension part. > > > > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you > > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its > > storage type. So the storage type needs to be a valid core Arrow data > type > > for this reason. > > > > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook wrote: > > > > > The vote on adding a JSON canonical extension type [1] got me > wondering: > > Is > > > it possible to define an extension type that is based on a canonical > > > extension type? If so, how? > > > > > > For example, say I wanted to define a (non-canonical) HLLSKETCH > extension > > > type that corresponds to the type that Redshift uses for HyperLogLog > > > sketches and is represented as JSON [2]. Is there a way to do this by > > > building on the JSON canonical extension type? > > > > > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq > > > [2] > https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html > > > > > > Ian > > > > > >
Re: [Discuss] Extension types based on canonical extension types?
I think "inheritance" and "composition" are more concerns for implementations than they are for spec (I could be wrong here). So it seems that it would be sufficient to write the HLLSKETCH's canonical definition as "this is an extension of the JSON logical type and supports all the same storage types" and then allow implementations to use whatever inheritance / composition scheme they want to behind the scenes. On Tue, Apr 30, 2024 at 7:47 AM Matt Topol wrote: > I think the biggest blocker to doing this is the way that we pass extension > types through IPC. Extension types are sent as their underlying storage > type with metadata key-value pairs of specific keys "ARROW:extension:name" > and "ARROW:extension:metadata". Since you can't have multiple values for > the same key in the metadata, this would prevent the ability to define an > extension type in terms of another extension type as you wouldn't be able > to include the metadata for the second-level extension part. > > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its > storage type. So the storage type needs to be a valid core Arrow data type > for this reason. > > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook wrote: > > > The vote on adding a JSON canonical extension type [1] got me wondering: > Is > > it possible to define an extension type that is based on a canonical > > extension type? If so, how? > > > > For example, say I wanted to define a (non-canonical) HLLSKETCH extension > > type that corresponds to the type that Redshift uses for HyperLogLog > > sketches and is represented as JSON [2]. Is there a way to do this by > > building on the JSON canonical extension type? > > > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq > > [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html > > > > Ian > > >
Re: [VOTE][Format] UUID canonical extension type
+1 (binding) On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc wrote: > Thanks for all the reviews and comments! I've included the big-endian > requirement so the proposed language is now as below. > I'll leave the vote open until after the May holiday. > > Rok > > UUID > > > * Extension name: `arrow.uuid`. > > * The storage type of the extension is ``FixedSizeBinary`` with a length of > 16 bytes. > > .. note:: >A specific UUID version is not required or guaranteed. This extension > represents >UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not > interpret the bytes in any way. >
Re: [VOTE][Format] JSON canonical extension type
+1 (binding) I agree we should be explicit about RFC-8259 On Mon, Apr 29, 2024 at 4:46 PM David Li wrote: > +1 (binding) > > assuming we explicitly state RFC-8259 > > On Tue, Apr 30, 2024, at 08:02, Matt Topol wrote: > > +1 (binding) > > > > On Mon, Apr 29, 2024 at 5:36 PM Ian Cook wrote: > > > >> +1 (non-binding) > >> > >> I added a comment in the PR suggesting that we explicitly refer to > RFC-8259 > >> in CanonicalExtensions.rst. > >> > >> On Mon, Apr 29, 2024 at 1:21 PM Micah Kornfield > >> wrote: > >> > >> > +1, I added a comment to the PR because I think we should recommend > >> > implementations specifically reject parsing Binary arrays with the > >> > annotation in-case we want to support non-UTF8 encodings in the future > >> > (even thought IIRC these aren't really JSON spec compliant). > >> > > >> > On Fri, Apr 19, 2024 at 1:24 PM Rok Mihevc > wrote: > >> > > >> > > Hi all, > >> > > > >> > > Following discussions [1][2] and preliminary implementation work (by > >> > > Pradeep Gollakota) [3] I would like to propose a vote to add > language > >> for > >> > > JSON canonical extension type to CanonicalExtensions.rst as in PR > [4] > >> and > >> > > written below. > >> > > A draft C++ implementation PR can be seen here [3]. > >> > > > >> > > [1] > https://lists.apache.org/thread/p3353oz6lk846pnoq6vk638tjqz2hm1j > >> > > [2] > https://lists.apache.org/thread/7xph3476g9rhl9mtqvn804fqf5z8yoo1 > >> > > [3] https://github.com/apache/arrow/pull/13901 > >> > > [4] https://github.com/apache/arrow/pull/41257 <- proposed change > >> > > > >> > > > >> > > The vote will be open for at least 72 hours. > >> > > > >> > > [ ] +1 Accept this proposal > >> > > [ ] +0 > >> > > [ ] -1 Do not accept this proposal because... > >> > > > >> > > > >> > > JSON > >> > > > >> > > > >> > > * Extension name: `arrow.json`. > >> > > > >> > > * The storage type of this extension is ``StringArray`` or > >> > > or ``LargeStringArray`` or ``StringViewArray``. > >> > > Only UTF-8 encoded JSON is supported. > >> > > > >> > > * Extension type parameters: > >> > > > >> > > This type does not have any parameters. > >> > > > >> > > * Description of the serialization: > >> > > > >> > > Metadata is either an empty string or a JSON string with an empty > >> > object. > >> > > In the future, additional fields may be added, but they are not > >> > required > >> > > to interpret the array. > >> > > > >> > > > >> > > > >> > > Rok > >> > > > >> > > >> >
Re: [DISCUSS] Drop Java 8 support
Hello everyone! Great to see this move forward! +1 on dropping both 8 and 11 unless there is very good reason to keep 11 around. Otherwise people will just move to 11 and then have the pain of migration again when we drop that (which will happen soon regardless imo). Am Di., 30. Apr. 2024 um 16:18 Uhr schrieb Dane Pitkin : > Thanks, JB. Are we aware of any downstream dependencies that would benefit > from maintaining Java 11 support? Apache Spark jumped straight to Java 17. > It seems other projects are dropping both 8 and 11 at the same time as > mentioned by Fokko. From a maintenance perspective, it would be nice to > drop both. > > On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré > wrote: > > > Hi > > > > I think it's time to drop JDK8 support. I would say that we should > > keep Java11 (jumping directly to Java17 would be problematic > > potentially for some users I guess). > > > > Regards > > JB > > > > On Thu, Apr 25, 2024 at 10:21 PM James Duong > > wrote: > > > > > > If we dropped JDK 8, we could use the JDK to compile module-info.java > > files. Then we could remove the custom maven plugin we’re using for > > compiling module-info.java files for JPMS support and get better IDE > > integration (as what we’re doing currently somewhat shoe-horns module > > information alongside JDK8 bytecode). > > > > > > From: Dane Pitkin > > > Date: Thursday, April 25, 2024 at 1:02 PM > > > To: dev@arrow.apache.org > > > Subject: [DISCUSS] Drop Java 8 support > > > Hi all, > > > > > > I would like to revisit the discussion of dropping Java 8 (and maybe > 11) > > > from Arrow's Java implementation. See GH issue[1] below. This was also > > > discussed in the last Arrow community sync meeting on 2024-04-24. > > > > > > For context, this was discussed[2] last year on this mailing list. We > > > decided to revisit the discussion around the June 2024 release (Arrow > > v17). > > > The timing coincides with the initial release of Apache Spark 4.0.0, > > which > > > drops both Java 8 and 11 support. > > > > > > For background, we chose not to drop Java 8 support last year because > > Arrow > > > is seen as a low level library that should support as many environments > > as > > > possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11) > > as > > > exemplified by Apache Spark as well as Apache Iceberg[3]. > > > > > > Is it time to consider dropping Java 8? Should we drop Java 11 and skip > > > straight to Java 17 as our minimum version? What implications do we > need > > to > > > be aware of? > > > > > > Thanks, > > > Dane > > > > > > [1]https://github.com/apache/arrow/issues/38051 > > > [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8 > > > [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368 > > >
[CROWDSOURCING] May 2024 ASF Board Report
As part of being a new project, we need to submit reports to the board every month for the first three months[1]. In the tradition of Apache Arrow, I hope the community can help draft this report. Please take a look and add anything that might be relevant[2]. Thanks, Andrew [1]: https://github.com/apache/datafusion/issues/10281 [2]: https://docs.google.com/document/d/1knyR2epIOY7WoXZO_DOtlcPNSenb3-V-osCHqPXqSms/edit
Re: [VOTE][Format] UUID canonical extension type
Thanks for all the reviews and comments! I've included the big-endian requirement so the proposed language is now as below. I'll leave the vote open until after the May holiday. Rok UUID * Extension name: `arrow.uuid`. * The storage type of the extension is ``FixedSizeBinary`` with a length of 16 bytes. .. note:: A specific UUID version is not required or guaranteed. This extension represents UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not interpret the bytes in any way.
Re: [Discuss] Extension types based on canonical extension types?
I think the biggest blocker to doing this is the way that we pass extension types through IPC. Extension types are sent as their underlying storage type with metadata key-value pairs of specific keys "ARROW:extension:name" and "ARROW:extension:metadata". Since you can't have multiple values for the same key in the metadata, this would prevent the ability to define an extension type in terms of another extension type as you wouldn't be able to include the metadata for the second-level extension part. i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its storage type. So the storage type needs to be a valid core Arrow data type for this reason. On Tue, Apr 30, 2024 at 10:16 AM Ian Cook wrote: > The vote on adding a JSON canonical extension type [1] got me wondering: Is > it possible to define an extension type that is based on a canonical > extension type? If so, how? > > For example, say I wanted to define a (non-canonical) HLLSKETCH extension > type that corresponds to the type that Redshift uses for HyperLogLog > sketches and is represented as JSON [2]. Is there a way to do this by > building on the JSON canonical extension type? > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq > [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html > > Ian >
Re: [DISCUSS] Drop Java 8 support
Thanks, JB. Are we aware of any downstream dependencies that would benefit from maintaining Java 11 support? Apache Spark jumped straight to Java 17. It seems other projects are dropping both 8 and 11 at the same time as mentioned by Fokko. From a maintenance perspective, it would be nice to drop both. On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré wrote: > Hi > > I think it's time to drop JDK8 support. I would say that we should > keep Java11 (jumping directly to Java17 would be problematic > potentially for some users I guess). > > Regards > JB > > On Thu, Apr 25, 2024 at 10:21 PM James Duong > wrote: > > > > If we dropped JDK 8, we could use the JDK to compile module-info.java > files. Then we could remove the custom maven plugin we’re using for > compiling module-info.java files for JPMS support and get better IDE > integration (as what we’re doing currently somewhat shoe-horns module > information alongside JDK8 bytecode). > > > > From: Dane Pitkin > > Date: Thursday, April 25, 2024 at 1:02 PM > > To: dev@arrow.apache.org > > Subject: [DISCUSS] Drop Java 8 support > > Hi all, > > > > I would like to revisit the discussion of dropping Java 8 (and maybe 11) > > from Arrow's Java implementation. See GH issue[1] below. This was also > > discussed in the last Arrow community sync meeting on 2024-04-24. > > > > For context, this was discussed[2] last year on this mailing list. We > > decided to revisit the discussion around the June 2024 release (Arrow > v17). > > The timing coincides with the initial release of Apache Spark 4.0.0, > which > > drops both Java 8 and 11 support. > > > > For background, we chose not to drop Java 8 support last year because > Arrow > > is seen as a low level library that should support as many environments > as > > possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11) > as > > exemplified by Apache Spark as well as Apache Iceberg[3]. > > > > Is it time to consider dropping Java 8? Should we drop Java 11 and skip > > straight to Java 17 as our minimum version? What implications do we need > to > > be aware of? > > > > Thanks, > > Dane > > > > [1]https://github.com/apache/arrow/issues/38051 > > [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8 > > [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368 >
[Discuss] Extension types based on canonical extension types?
The vote on adding a JSON canonical extension type [1] got me wondering: Is it possible to define an extension type that is based on a canonical extension type? If so, how? For example, say I wanted to define a (non-canonical) HLLSKETCH extension type that corresponds to the type that Redshift uses for HyperLogLog sketches and is represented as JSON [2]. Is there a way to do this by building on the JSON canonical extension type? [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html Ian