[ANNOUNCE] DataFusion Comet regular meetup

2024-04-30 Thread Andy Grove
*Note: I had previously sent a version of this email to the new DataFusion
dev@ mailing list, but I don't think many people have migrated to that yet,
so I am sending it to dev@ arrow as well.*

Hello,

I would like to invite anyone interested to join a regular meetup to
discuss the DataFusion Comet project with some of the core contributors.

This is a great place for new contributors to ask questions and coordinate
on working on issues.

 The details are in the attached Google document [1], and the first meeting
will be tomorrow, Wednesday, May 1st.

Thanks,

Andy.

[1]
https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit?usp=sharing


Re: [DISCUSS] Drop Java 8 support

2024-04-30 Thread martin . traverse
Speaking for my own product we would like to see Java 11 support, we rely 
heavily on Arrow and have Java 11 as our minimum supported version. We’d like 
to keep doing that if possible. Our clients are big enterprises with 
notoriously sluggish update cycles, so we want to offer maximum compatibility. 
Once security patches are no longer available on the regular public channels 
then there is a compliance issue, so we generally follow the EOL schedule of 
our dependencies.

Corretto, Adoptium and Zulu all have recent public builds of both 8 and 11 and 
look set to support them with public builds for many years to come. Several 
organisations I have worked with switched away from Oracle when they made their 
licensing blunder with Java 8 and although that is rectified now, the change 
seems to have stuck in quite a few places (at least in my anecdotal experience).

A major practical difference to me in Java 17 is the strong encapsulation of 
internals. Since that affects the majority of serious Java applications then 
perhaps most people have figured out by now to add the JVM params that let Java 
continue working. Still, it could be a consideration, if  Java17 is the 
baseline supported version.

Regards,
Martin.

- In case anyone is curious why we don’t support Java 8 per our own policy, 
it’s because of the “var” keyword - seriously, why did Java take so long with 
that, even C++ got there sooner!

> On 30 Apr 2024, at 16:20, Jacob Wujciak  wrote:
> 
> Hello everyone!
> Great to see this move forward!
> +1 on dropping both 8 and 11 unless there is very good reason to keep 11
> around.
> Otherwise people will just move to 11 and then have the pain of migration
> again when we drop that (which will happen soon regardless imo).
> 
> Am Di., 30. Apr. 2024 um 16:18 Uhr schrieb Dane Pitkin
> :
> 
>> Thanks, JB. Are we aware of any downstream dependencies that would benefit
>> from maintaining Java 11 support? Apache Spark jumped straight to Java 17.
>> It seems other projects are dropping both 8 and 11 at the same time as
>> mentioned by Fokko. From a maintenance perspective, it would be nice to
>> drop both.
>> 
>> On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré 
>> wrote:
>> 
>>> Hi
>>> 
>>> I think it's time to drop JDK8 support. I would say that we should
>>> keep Java11 (jumping directly to Java17 would be problematic
>>> potentially for some users I guess).
>>> 
>>> Regards
>>> JB
>>> 
>>> On Thu, Apr 25, 2024 at 10:21 PM James Duong
>>>  wrote:
 
 If we dropped JDK 8, we could use the JDK to compile module-info.java
>>> files. Then we could remove the custom maven plugin we’re using for
>>> compiling module-info.java files for JPMS support and get better IDE
>>> integration (as what we’re doing currently somewhat shoe-horns module
>>> information alongside JDK8 bytecode).
 
 From: Dane Pitkin 
 Date: Thursday, April 25, 2024 at 1:02 PM
 To: dev@arrow.apache.org 
 Subject: [DISCUSS] Drop Java 8 support
 Hi all,
 
 I would like to revisit the discussion of dropping Java 8 (and maybe
>> 11)
 from Arrow's Java implementation. See GH issue[1] below. This was also
 discussed in the last Arrow community sync meeting on 2024-04-24.
 
 For context, this was discussed[2] last year on this mailing list. We
 decided to revisit the discussion around the June 2024 release (Arrow
>>> v17).
 The timing coincides with the initial release of Apache Spark 4.0.0,
>>> which
 drops both Java 8 and 11 support.
 
 For background, we chose not to drop Java 8 support last year because
>>> Arrow
 is seen as a low level library that should support as many environments
>>> as
 possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11)
>>> as
 exemplified by Apache Spark as well as Apache Iceberg[3].
 
 Is it time to consider dropping Java 8? Should we drop Java 11 and skip
 straight to Java 17 as our minimum version? What implications do we
>> need
>>> to
 be aware of?
 
 Thanks,
 Dane
 
 [1]https://github.com/apache/arrow/issues/38051
 [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8
 [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
>>> 
>> 



Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou

+1 (binding)


Le 19/04/2024 à 22:22, Rok Mihevc a écrit :

Hi all,

Following initial requests [1][2] and recent tangential ML discussion [3] I
would like to propose a vote to add language for UUID canonical extension
type to CanonicalExtensions.rst as in PR [4] and written below.
A draft C++ and Python implementation PR can be seen here [5].

[1] https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j
[2] https://github.com/apache/arrow/issues/15058
[3] https://lists.apache.org/thread/8d5ldl5cb7mms21rd15lhpfrv4j9no4n
[4] https://github.com/apache/arrow/pull/41299 <- proposed change
[5] https://github.com/apache/arrow/pull/37298


The vote will be open for at least 72 hours.

[ ] +1 Accept this proposal
[ ] +0
[ ] -1 Do not accept this proposal because...


UUID


* Extension name: `arrow.uuid`.

* The storage type of the extension is ``FixedSizeBinary`` with a length of
16 bytes.

.. note::
A specific UUID version is not required or guaranteed. This extension
represents
UUIDs as FixedSizeBinary(16) and does not interpret the bytes in any way.



Rok



Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 
requirement and the 3 current String types allowed.


Regards

Antoine.


Le 30/04/2024 à 19:26, Rok Mihevc a écrit :

Hi all, thanks for the votes and comments so far.
I've amended [1] the proposed language with the RFC-8259 requirement as it
seems to be almost unanimously requested. New language is below.
To Micah's comment regarding rejecting Binary arrays [2] - please discuss
in the PR.

Let's leave the vote open until after the May holiday.

Rok

[1]
https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e
[2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040


JSON


* Extension name: `arrow.json`.

* The storage type of this extension is ``StringArray`` or
   or ``LargeStringArray`` or ``StringViewArray``.
   *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.*

* Extension type parameters:

   This type does not have any parameters.

* Description of the serialization:

   Metadata is either an empty string or a JSON string with an empty object.
   In the future, additional fields may be added, but they are not required
   to interpret the array.



Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Jacob Wujciak
+1 (non-binding) Thanks for moving these two forward Rok!

Am Di., 30. Apr. 2024 um 19:26 Uhr schrieb Rok Mihevc :

> Hi all, thanks for the votes and comments so far.
> I've amended [1] the proposed language with the RFC-8259 requirement as it
> seems to be almost unanimously requested. New language is below.
> To Micah's comment regarding rejecting Binary arrays [2] - please discuss
> in the PR.
>
> Let's leave the vote open until after the May holiday.
>
> Rok
>
> [1]
>
> https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e
> [2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040
>
>
> JSON
> 
>
> * Extension name: `arrow.json`.
>
> * The storage type of this extension is ``StringArray`` or
>   or ``LargeStringArray`` or ``StringViewArray``.
>   *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.*
>
> * Extension type parameters:
>
>   This type does not have any parameters.
>
> * Description of the serialization:
>
>   Metadata is either an empty string or a JSON string with an empty object.
>   In the future, additional fields may be added, but they are not required
>   to interpret the array.
>


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Joris Van den Bossche
+1 (binding)

On Tue, 30 Apr 2024 at 19:52, Jacob Wujciak  wrote:

> +1 (non-binding)
>
> Am Di., 30. Apr. 2024 um 17:48 Uhr schrieb Weston Pace <
> weston.p...@gmail.com>:
>
> > +1 (binding)
> >
> > On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc  wrote:
> >
> > > Thanks for all the reviews and comments! I've included the big-endian
> > > requirement so the proposed language is now as below.
> > > I'll leave the vote open until after the May holiday.
> > >
> > > Rok
> > >
> > > UUID
> > > 
> > >
> > > * Extension name: `arrow.uuid`.
> > >
> > > * The storage type of the extension is ``FixedSizeBinary`` with a
> length
> > of
> > > 16 bytes.
> > >
> > > .. note::
> > >A specific UUID version is not required or guaranteed. This
> extension
> > > represents
> > >UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
> > > interpret the bytes in any way.
> > >
> >
>


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Jacob Wujciak
+1 (non-binding)

Am Di., 30. Apr. 2024 um 17:48 Uhr schrieb Weston Pace <
weston.p...@gmail.com>:

> +1 (binding)
>
> On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc  wrote:
>
> > Thanks for all the reviews and comments! I've included the big-endian
> > requirement so the proposed language is now as below.
> > I'll leave the vote open until after the May holiday.
> >
> > Rok
> >
> > UUID
> > 
> >
> > * Extension name: `arrow.uuid`.
> >
> > * The storage type of the extension is ``FixedSizeBinary`` with a length
> of
> > 16 bytes.
> >
> > .. note::
> >A specific UUID version is not required or guaranteed. This extension
> > represents
> >UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
> > interpret the bytes in any way.
> >
>


Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Rok Mihevc
Hi all, thanks for the votes and comments so far.
I've amended [1] the proposed language with the RFC-8259 requirement as it
seems to be almost unanimously requested. New language is below.
To Micah's comment regarding rejecting Binary arrays [2] - please discuss
in the PR.

Let's leave the vote open until after the May holiday.

Rok

[1]
https://github.com/apache/arrow/pull/41257/commits/594945010e3b7d393b411aad971743ffcdbdbc8e
[2] https://github.com/apache/arrow/pull/41257#discussion_r1583441040


JSON


* Extension name: `arrow.json`.

* The storage type of this extension is ``StringArray`` or
  or ``LargeStringArray`` or ``StringViewArray``.
  *Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.*

* Extension type parameters:

  This type does not have any parameters.

* Description of the serialization:

  Metadata is either an empty string or a JSON string with an empty object.
  In the future, additional fields may be added, but they are not required
  to interpret the array.


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Dewey Dunnington
I don't think there is any current barrier to using implementation
features of one extension type to help with another. In Python, for
example, one might be able to do:

class GeoJSONExtensionType(pa.ExtensionType):

def __init__(self):
self._json_ext = pa.JSONExtensionType()

def some_action(self):
return self._json_ext.some_action()

One could do something similar with the Array/Scalar classes. I am not
sure there is anything "automatic" that any current implementation
would be able to offer even if this information were machine
parseable. The only thing I can think of is that implementations like
Arrow C++ that aggressively drop extension information might be able
to drop the extension type by assigning a different one; however, I am
not sure that it would be useful enough to ever be implemented.

-dewey

On Tue, Apr 30, 2024 at 1:31 PM Ian Cook  wrote:
>
> But consider that a user might want to define a
> non-canonical HLLSKETCH extension type and make use of Arrow
> implementations' features for handling JSON canonical extension type
> columns in order to handle HLLSKETCH extension type columns. The spec
> currently does not provide any means to enable this. I wonder if we should
> consider incorporating something like this into the spec.
>
> For example, maybe the colon character could have the special meaning
> "represented as" in extension type names, so that implementations would
> recognize "hllsketch:arrow.json" as meaning: a column with extension type
> hllsketch, which is represented as in the JSON canonical extension type.
>
> Ian
>
> On Tue, Apr 30, 2024 at 11:51 AM Weston Pace  wrote:
>
> > I think "inheritance" and "composition" are more concerns for
> > implementations than they are for spec (I could be wrong here).
> >
> > So it seems that it would be sufficient to write the HLLSKETCH's canonical
> > definition as "this is an extension of the JSON logical type and supports
> > all the same storage types" and then allow implementations to use whatever
> > inheritance / composition scheme they want to behind the scenes.
> >
> > On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:
> >
> > > I think the biggest blocker to doing this is the way that we pass
> > extension
> > > types through IPC. Extension types are sent as their underlying storage
> > > type with metadata key-value pairs of specific keys
> > "ARROW:extension:name"
> > > and "ARROW:extension:metadata". Since you can't have multiple values for
> > > the same key in the metadata, this would prevent the ability to define an
> > > extension type in terms of another extension type as you wouldn't be able
> > > to include the metadata for the second-level extension part.
> > >
> > > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> > > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> > > storage type. So the storage type needs to be a valid core Arrow data
> > type
> > > for this reason.
> > >
> > > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
> > >
> > > > The vote on adding a JSON canonical extension type [1] got me
> > wondering:
> > > Is
> > > > it possible to define an extension type that is based on a canonical
> > > > extension type? If so, how?
> > > >
> > > > For example, say I wanted to define a (non-canonical) HLLSKETCH
> > extension
> > > > type that corresponds to the type that Redshift uses for HyperLogLog
> > > > sketches and is represented as JSON [2]. Is there a way to do this by
> > > > building on the JSON canonical extension type?
> > > >
> > > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > > > [2]
> > https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> > > >
> > > > Ian
> > > >
> > >
> >


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Ian Cook
But consider that a user might want to define a
non-canonical HLLSKETCH extension type and make use of Arrow
implementations' features for handling JSON canonical extension type
columns in order to handle HLLSKETCH extension type columns. The spec
currently does not provide any means to enable this. I wonder if we should
consider incorporating something like this into the spec.

For example, maybe the colon character could have the special meaning
"represented as" in extension type names, so that implementations would
recognize "hllsketch:arrow.json" as meaning: a column with extension type
hllsketch, which is represented as in the JSON canonical extension type.

Ian

On Tue, Apr 30, 2024 at 11:51 AM Weston Pace  wrote:

> I think "inheritance" and "composition" are more concerns for
> implementations than they are for spec (I could be wrong here).
>
> So it seems that it would be sufficient to write the HLLSKETCH's canonical
> definition as "this is an extension of the JSON logical type and supports
> all the same storage types" and then allow implementations to use whatever
> inheritance / composition scheme they want to behind the scenes.
>
> On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:
>
> > I think the biggest blocker to doing this is the way that we pass
> extension
> > types through IPC. Extension types are sent as their underlying storage
> > type with metadata key-value pairs of specific keys
> "ARROW:extension:name"
> > and "ARROW:extension:metadata". Since you can't have multiple values for
> > the same key in the metadata, this would prevent the ability to define an
> > extension type in terms of another extension type as you wouldn't be able
> > to include the metadata for the second-level extension part.
> >
> > i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> > wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> > storage type. So the storage type needs to be a valid core Arrow data
> type
> > for this reason.
> >
> > On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
> >
> > > The vote on adding a JSON canonical extension type [1] got me
> wondering:
> > Is
> > > it possible to define an extension type that is based on a canonical
> > > extension type? If so, how?
> > >
> > > For example, say I wanted to define a (non-canonical) HLLSKETCH
> extension
> > > type that corresponds to the type that Redshift uses for HyperLogLog
> > > sketches and is represented as JSON [2]. Is there a way to do this by
> > > building on the JSON canonical extension type?
> > >
> > > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > > [2]
> https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> > >
> > > Ian
> > >
> >
>


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Weston Pace
I think "inheritance" and "composition" are more concerns for
implementations than they are for spec (I could be wrong here).

So it seems that it would be sufficient to write the HLLSKETCH's canonical
definition as "this is an extension of the JSON logical type and supports
all the same storage types" and then allow implementations to use whatever
inheritance / composition scheme they want to behind the scenes.

On Tue, Apr 30, 2024 at 7:47 AM Matt Topol  wrote:

> I think the biggest blocker to doing this is the way that we pass extension
> types through IPC. Extension types are sent as their underlying storage
> type with metadata key-value pairs of specific keys "ARROW:extension:name"
> and "ARROW:extension:metadata". Since you can't have multiple values for
> the same key in the metadata, this would prevent the ability to define an
> extension type in terms of another extension type as you wouldn't be able
> to include the metadata for the second-level extension part.
>
> i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
> wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
> storage type. So the storage type needs to be a valid core Arrow data type
> for this reason.
>
> On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:
>
> > The vote on adding a JSON canonical extension type [1] got me wondering:
> Is
> > it possible to define an extension type that is based on a canonical
> > extension type? If so, how?
> >
> > For example, say I wanted to define a (non-canonical) HLLSKETCH extension
> > type that corresponds to the type that Redshift uses for HyperLogLog
> > sketches and is represented as JSON [2]. Is there a way to do this by
> > building on the JSON canonical extension type?
> >
> > [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> > [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
> >
> > Ian
> >
>


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding)

On Tue, Apr 30, 2024 at 7:53 AM Rok Mihevc  wrote:

> Thanks for all the reviews and comments! I've included the big-endian
> requirement so the proposed language is now as below.
> I'll leave the vote open until after the May holiday.
>
> Rok
>
> UUID
> 
>
> * Extension name: `arrow.uuid`.
>
> * The storage type of the extension is ``FixedSizeBinary`` with a length of
> 16 bytes.
>
> .. note::
>A specific UUID version is not required or guaranteed. This extension
> represents
>UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
> interpret the bytes in any way.
>


Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Weston Pace
+1 (binding)

I agree we should be explicit about RFC-8259

On Mon, Apr 29, 2024 at 4:46 PM David Li  wrote:

> +1 (binding)
>
> assuming we explicitly state RFC-8259
>
> On Tue, Apr 30, 2024, at 08:02, Matt Topol wrote:
> > +1 (binding)
> >
> > On Mon, Apr 29, 2024 at 5:36 PM Ian Cook  wrote:
> >
> >> +1 (non-binding)
> >>
> >> I added a comment in the PR suggesting that we explicitly refer to
> RFC-8259
> >> in CanonicalExtensions.rst.
> >>
> >> On Mon, Apr 29, 2024 at 1:21 PM Micah Kornfield 
> >> wrote:
> >>
> >> > +1, I added a comment to the PR because I think we should recommend
> >> > implementations specifically reject parsing Binary arrays with the
> >> > annotation in-case we want to support non-UTF8 encodings in the future
> >> > (even thought IIRC these aren't really JSON spec compliant).
> >> >
> >> > On Fri, Apr 19, 2024 at 1:24 PM Rok Mihevc 
> wrote:
> >> >
> >> > > Hi all,
> >> > >
> >> > > Following discussions [1][2] and preliminary implementation work (by
> >> > > Pradeep Gollakota) [3] I would like to propose a vote to add
> language
> >> for
> >> > > JSON canonical extension type to CanonicalExtensions.rst as in PR
> [4]
> >> and
> >> > > written below.
> >> > > A draft C++ implementation PR can be seen here [3].
> >> > >
> >> > > [1]
> https://lists.apache.org/thread/p3353oz6lk846pnoq6vk638tjqz2hm1j
> >> > > [2]
> https://lists.apache.org/thread/7xph3476g9rhl9mtqvn804fqf5z8yoo1
> >> > > [3] https://github.com/apache/arrow/pull/13901
> >> > > [4] https://github.com/apache/arrow/pull/41257 <- proposed change
> >> > >
> >> > >
> >> > > The vote will be open for at least 72 hours.
> >> > >
> >> > > [ ] +1 Accept this proposal
> >> > > [ ] +0
> >> > > [ ] -1 Do not accept this proposal because...
> >> > >
> >> > >
> >> > > JSON
> >> > > 
> >> > >
> >> > > * Extension name: `arrow.json`.
> >> > >
> >> > > * The storage type of this extension is ``StringArray`` or
> >> > >   or ``LargeStringArray`` or ``StringViewArray``.
> >> > >   Only UTF-8 encoded JSON is supported.
> >> > >
> >> > > * Extension type parameters:
> >> > >
> >> > >   This type does not have any parameters.
> >> > >
> >> > > * Description of the serialization:
> >> > >
> >> > >   Metadata is either an empty string or a JSON string with an empty
> >> > object.
> >> > >   In the future, additional fields may be added, but they are not
> >> > required
> >> > >   to interpret the array.
> >> > >
> >> > >
> >> > >
> >> > > Rok
> >> > >
> >> >
> >>
>


Re: [DISCUSS] Drop Java 8 support

2024-04-30 Thread Jacob Wujciak
Hello everyone!
Great to see this move forward!
+1 on dropping both 8 and 11 unless there is very good reason to keep 11
around.
Otherwise people will just move to 11 and then have the pain of migration
again when we drop that (which will happen soon regardless imo).

Am Di., 30. Apr. 2024 um 16:18 Uhr schrieb Dane Pitkin
:

> Thanks, JB. Are we aware of any downstream dependencies that would benefit
> from maintaining Java 11 support? Apache Spark jumped straight to Java 17.
> It seems other projects are dropping both 8 and 11 at the same time as
> mentioned by Fokko. From a maintenance perspective, it would be nice to
> drop both.
>
> On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré 
> wrote:
>
> > Hi
> >
> > I think it's time to drop JDK8 support. I would say that we should
> > keep Java11 (jumping directly to Java17 would be problematic
> > potentially for some users I guess).
> >
> > Regards
> > JB
> >
> > On Thu, Apr 25, 2024 at 10:21 PM James Duong
> >  wrote:
> > >
> > > If we dropped JDK 8, we could use the JDK to compile module-info.java
> > files. Then we could remove the custom maven plugin we’re using for
> > compiling module-info.java files for JPMS support and get better IDE
> > integration (as what we’re doing currently somewhat shoe-horns module
> > information alongside JDK8 bytecode).
> > >
> > > From: Dane Pitkin 
> > > Date: Thursday, April 25, 2024 at 1:02 PM
> > > To: dev@arrow.apache.org 
> > > Subject: [DISCUSS] Drop Java 8 support
> > > Hi all,
> > >
> > > I would like to revisit the discussion of dropping Java 8 (and maybe
> 11)
> > > from Arrow's Java implementation. See GH issue[1] below. This was also
> > > discussed in the last Arrow community sync meeting on 2024-04-24.
> > >
> > > For context, this was discussed[2] last year on this mailing list. We
> > > decided to revisit the discussion around the June 2024 release (Arrow
> > v17).
> > > The timing coincides with the initial release of Apache Spark 4.0.0,
> > which
> > > drops both Java 8 and 11 support.
> > >
> > > For background, we chose not to drop Java 8 support last year because
> > Arrow
> > > is seen as a low level library that should support as many environments
> > as
> > > possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11)
> > as
> > > exemplified by Apache Spark as well as Apache Iceberg[3].
> > >
> > > Is it time to consider dropping Java 8? Should we drop Java 11 and skip
> > > straight to Java 17 as our minimum version? What implications do we
> need
> > to
> > > be aware of?
> > >
> > > Thanks,
> > > Dane
> > >
> > > [1]https://github.com/apache/arrow/issues/38051
> > > [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8
> > > [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
> >
>


[CROWDSOURCING] May 2024 ASF Board Report

2024-04-30 Thread Andrew Lamb
As part of being a new project, we need to submit reports to the board
every month for the first three months[1].

In the tradition of Apache Arrow, I hope the community can help draft this
report. Please take a look and add anything that might be relevant[2].

Thanks,
Andrew

[1]: https://github.com/apache/datafusion/issues/10281
[2]:
https://docs.google.com/document/d/1knyR2epIOY7WoXZO_DOtlcPNSenb3-V-osCHqPXqSms/edit


Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Rok Mihevc
Thanks for all the reviews and comments! I've included the big-endian
requirement so the proposed language is now as below.
I'll leave the vote open until after the May holiday.

Rok

UUID


* Extension name: `arrow.uuid`.

* The storage type of the extension is ``FixedSizeBinary`` with a length of
16 bytes.

.. note::
   A specific UUID version is not required or guaranteed. This extension
represents
   UUIDs as FixedSizeBinary(16) *with big-endian notation* and does not
interpret the bytes in any way.


Re: [Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Matt Topol
I think the biggest blocker to doing this is the way that we pass extension
types through IPC. Extension types are sent as their underlying storage
type with metadata key-value pairs of specific keys "ARROW:extension:name"
and "ARROW:extension:metadata". Since you can't have multiple values for
the same key in the metadata, this would prevent the ability to define an
extension type in terms of another extension type as you wouldn't be able
to include the metadata for the second-level extension part.

i.e. you'd be able to have "ARROW:extension:name" => "HLLSKETCH", but you
wouldn't be able to *also* have "ARROW:extension:name" => "JSON" for its
storage type. So the storage type needs to be a valid core Arrow data type
for this reason.

On Tue, Apr 30, 2024 at 10:16 AM Ian Cook  wrote:

> The vote on adding a JSON canonical extension type [1] got me wondering: Is
> it possible to define an extension type that is based on a canonical
> extension type? If so, how?
>
> For example, say I wanted to define a (non-canonical) HLLSKETCH extension
> type that corresponds to the type that Redshift uses for HyperLogLog
> sketches and is represented as JSON [2]. Is there a way to do this by
> building on the JSON canonical extension type?
>
> [1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
> [2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html
>
> Ian
>


Re: [DISCUSS] Drop Java 8 support

2024-04-30 Thread Dane Pitkin
Thanks, JB. Are we aware of any downstream dependencies that would benefit
from maintaining Java 11 support? Apache Spark jumped straight to Java 17.
It seems other projects are dropping both 8 and 11 at the same time as
mentioned by Fokko. From a maintenance perspective, it would be nice to
drop both.

On Mon, Apr 29, 2024 at 11:20 AM Jean-Baptiste Onofré 
wrote:

> Hi
>
> I think it's time to drop JDK8 support. I would say that we should
> keep Java11 (jumping directly to Java17 would be problematic
> potentially for some users I guess).
>
> Regards
> JB
>
> On Thu, Apr 25, 2024 at 10:21 PM James Duong
>  wrote:
> >
> > If we dropped JDK 8, we could use the JDK to compile module-info.java
> files. Then we could remove the custom maven plugin we’re using for
> compiling module-info.java files for JPMS support and get better IDE
> integration (as what we’re doing currently somewhat shoe-horns module
> information alongside JDK8 bytecode).
> >
> > From: Dane Pitkin 
> > Date: Thursday, April 25, 2024 at 1:02 PM
> > To: dev@arrow.apache.org 
> > Subject: [DISCUSS] Drop Java 8 support
> > Hi all,
> >
> > I would like to revisit the discussion of dropping Java 8 (and maybe 11)
> > from Arrow's Java implementation. See GH issue[1] below. This was also
> > discussed in the last Arrow community sync meeting on 2024-04-24.
> >
> > For context, this was discussed[2] last year on this mailing list. We
> > decided to revisit the discussion around the June 2024 release (Arrow
> v17).
> > The timing coincides with the initial release of Apache Spark 4.0.0,
> which
> > drops both Java 8 and 11 support.
> >
> > For background, we chose not to drop Java 8 support last year because
> Arrow
> > is seen as a low level library that should support as many environments
> as
> > possible. Nowadays, we see more enthusiasm for dropping Java 8 (and 11)
> as
> > exemplified by Apache Spark as well as Apache Iceberg[3].
> >
> > Is it time to consider dropping Java 8? Should we drop Java 11 and skip
> > straight to Java 17 as our minimum version? What implications do we need
> to
> > be aware of?
> >
> > Thanks,
> > Dane
> >
> > [1]https://github.com/apache/arrow/issues/38051
> > [2]https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8
> > [3]https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368
>


[Discuss] Extension types based on canonical extension types?

2024-04-30 Thread Ian Cook
The vote on adding a JSON canonical extension type [1] got me wondering: Is
it possible to define an extension type that is based on a canonical
extension type? If so, how?

For example, say I wanted to define a (non-canonical) HLLSKETCH extension
type that corresponds to the type that Redshift uses for HyperLogLog
sketches and is represented as JSON [2]. Is there a way to do this by
building on the JSON canonical extension type?

[1] https://lists.apache.org/thread/4dw3dnz6rjp5wz2240mn299p51d5tvtq
[2] https://docs.aws.amazon.com/redshift/latest/dg/r_HLLSKTECH_type.html

Ian