[RESULT][VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 35.0.0 RC1

2024-01-25 Thread Andy Grove
On Thu, Jan 25, 2024 at 8:33 AM Andy Grove  wrote:

> The vote passes with three binding +1 votes. Thanks, everyone.
>
> The release is available at
> https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-35.0.0/
>
> On Sun, Jan 21, 2024 at 12:38 PM L. C. Hsieh  wrote:
>
>> +1 (binding)
>>
>> Agreed with Andrew. This looks like a test only issue.
>> I think we should address the Expr PartialOrd further
>> (https://github.com/apache/arrow-datafusion/issues/8932), but it
>> should not block the release.
>>
>> Thanks Andy.
>>
>> On Sun, Jan 21, 2024 at 3:13 AM Andrew Lamb  wrote:
>> >
>> > +1 (binding)
>> >
>> > I verified it on Mac (M3).
>> >
>> > I got the same error in test_partial_ord and I agree it looks very much
>> the
>> > the same as https://github.com/apache/arrow-datafusion/pull/8908 -- a
>> test
>> > only issue that should not block the release
>> >
>> > Thanks Andy
>> >
>> >
>> > On Sat, Jan 20, 2024 at 10:43 AM Andy Grove 
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > I would like to propose a release of Apache Arrow DataFusion
>> > > Implementation,
>> > > version 35.0.0.
>> > >
>> > > This release candidate is based on commit:
>> > > e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1 [1]
>> > > The proposed release tarball and signatures are hosted at [2].
>> > > The changelog is located at [3].
>> > >
>> > > Please download, verify checksums and signatures, run the unit tests,
>> and
>> > > vote
>> > > on the release. The vote will be open for at least 72 hours.
>> > >
>> > > Only votes from PMC members are binding, but all members of the
>> community
>> > > are
>> > > encouraged to test the release and vote with "(non-binding)".
>> > >
>> > > The standard verification procedure is documented at
>> > >
>> > >
>> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
>> > > .
>> > >
>> > > [ ] +1 Release this as Apache Arrow DataFusion 35.0.0
>> > > [ ] +0
>> > > [ ] -1 Do not release this as Apache Arrow DataFusion 35.0.0
>> because...
>> > >
>> > > Here is my vote:
>> > >
>> > > +1
>> > >
>> > > [1]:
>> > >
>> > >
>> https://github.com/apache/arrow-datafusion/tree/e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1
>> > > [2]:
>> > >
>> > >
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-35.0.0-rc1
>> > > [3]:
>> > >
>> > >
>> https://github.com/apache/arrow-datafusion/blob/e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1/CHANGELOG.md
>> > >
>>
>


Re: [VOTE][RUST][DataFusion] Release Apache Arrow DataFusion 35.0.0 RC1

2024-01-25 Thread Andy Grove
The vote passes with three binding +1 votes. Thanks, everyone.

The release is available at
https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-35.0.0/

On Sun, Jan 21, 2024 at 12:38 PM L. C. Hsieh  wrote:

> +1 (binding)
>
> Agreed with Andrew. This looks like a test only issue.
> I think we should address the Expr PartialOrd further
> (https://github.com/apache/arrow-datafusion/issues/8932), but it
> should not block the release.
>
> Thanks Andy.
>
> On Sun, Jan 21, 2024 at 3:13 AM Andrew Lamb  wrote:
> >
> > +1 (binding)
> >
> > I verified it on Mac (M3).
> >
> > I got the same error in test_partial_ord and I agree it looks very much
> the
> > the same as https://github.com/apache/arrow-datafusion/pull/8908 -- a
> test
> > only issue that should not block the release
> >
> > Thanks Andy
> >
> >
> > On Sat, Jan 20, 2024 at 10:43 AM Andy Grove 
> wrote:
> >
> > > Hi,
> > >
> > > I would like to propose a release of Apache Arrow DataFusion
> > > Implementation,
> > > version 35.0.0.
> > >
> > > This release candidate is based on commit:
> > > e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1 [1]
> > > The proposed release tarball and signatures are hosted at [2].
> > > The changelog is located at [3].
> > >
> > > Please download, verify checksums and signatures, run the unit tests,
> and
> > > vote
> > > on the release. The vote will be open for at least 72 hours.
> > >
> > > Only votes from PMC members are binding, but all members of the
> community
> > > are
> > > encouraged to test the release and vote with "(non-binding)".
> > >
> > > The standard verification procedure is documented at
> > >
> > >
> https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#verifying-release-candidates
> > > .
> > >
> > > [ ] +1 Release this as Apache Arrow DataFusion 35.0.0
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Arrow DataFusion 35.0.0 because...
> > >
> > > Here is my vote:
> > >
> > > +1
> > >
> > > [1]:
> > >
> > >
> https://github.com/apache/arrow-datafusion/tree/e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1
> > > [2]:
> > >
> > >
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-35.0.0-rc1
> > > [3]:
> > >
> > >
> https://github.com/apache/arrow-datafusion/blob/e58446bbe9ebe3f5a2aae1abd3c17a694070b0d1/CHANGELOG.md
> > >
>


Re: [Python][Rust] Is Arrow Rust supposed to support S3-compatible storage with non-public certificates?

2024-01-25 Thread Raphael Taylor-Davies
On a related note, version 0.9.0 switched to using the system CAs by default 
[1], and so if you've added your private CA chain into there it should work.

[1]: https://github.com/apache/arrow-rs/pull/5056

On 25 January 2024 09:17:55 GMT, Raphael Taylor-Davies 
 wrote:
>The ticket for supporting self-signed certificates can be found here [1].
>
>If you switch the TLS backend to OpenSSL it may respect the SSL_CERT_FILE 
>environment variable, but I'm not very familiar with the particulars of that 
>library. This would require customising the Rust build, however, which may not 
>be possible if calling from python.
>
>Kind Regards,
>
>Raphael
>
>
>[1]: https://github.com/apache/arrow-rs/issues/5034
>
>On 25 January 2024 08:44:45 GMT, elveshoern32 
> wrote:
>>Since my question remained unanswered on the user list, I dare to ask again 
>>on the dev list:
>>
>>
>>While experimenting with polars [1] (which is based on arrow-rs) I found that 
>>it's not possible to read a single file from our on-prem S3-compatible 
>>storage.
>>
>>Any attempts result in SSL error messages:
>>
>>
>>
>>error trying to connect: invalid peer certificate: UnknownIssuer
>>
>>
>>
>>Such SSL errors are well-known to us and usually get fixed by setting the 
>>environment variable SSL_CERT_FILE (or something similar) pointing to our 
>>company's certstore.
>>
>>polars seems to ignore that env var.
>>
>>Now it's unclear to me whether this is an issue of polars or arrow-rs (or 
>>anything else).
>>
>>
>>
>>For more details see [2].
>>
>>
>>
>>[1] https://pola.rs/
>>
>>[2] https://github.com/pola-rs/polars/issues/13741 

Re: [Python][Rust] Is Arrow Rust supposed to support S3-compatible storage with non-public certificates?

2024-01-25 Thread Raphael Taylor-Davies
The ticket for supporting self-signed certificates can be found here [1].

If you switch the TLS backend to OpenSSL it may respect the SSL_CERT_FILE 
environment variable, but I'm not very familiar with the particulars of that 
library. This would require customising the Rust build, however, which may not 
be possible if calling from python.

Kind Regards,

Raphael


[1]: https://github.com/apache/arrow-rs/issues/5034

On 25 January 2024 08:44:45 GMT, elveshoern32 
 wrote:
>Since my question remained unanswered on the user list, I dare to ask again on 
>the dev list:
>
>
>While experimenting with polars [1] (which is based on arrow-rs) I found that 
>it's not possible to read a single file from our on-prem S3-compatible storage.
>
>Any attempts result in SSL error messages:
>
>
>
>error trying to connect: invalid peer certificate: UnknownIssuer
>
>
>
>Such SSL errors are well-known to us and usually get fixed by setting the 
>environment variable SSL_CERT_FILE (or something similar) pointing to our 
>company's certstore.
>
>polars seems to ignore that env var.
>
>Now it's unclear to me whether this is an issue of polars or arrow-rs (or 
>anything else).
>
>
>
>For more details see [2].
>
>
>
>[1] https://pola.rs/
>
>[2] https://github.com/pola-rs/polars/issues/13741 

Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-25 Thread Antoine Pitrou



Hello,

My own answers:

1) isDelta should be true only when a delta is being transmitted (to be 
appended to the existing dictionary with the same id); it should be 
false when a full dictionary is being transmitted (to replace the 
existing dictionary with the same id, if any)

2) yes, it could
3) yes
4) there's no reason it can't be valid

Regards

Antoine.


Le 25/01/2024 à 07:25, Micah Kornfield a écrit :

Hi Chris,
My interpretations:
1) I'm not sure it is clearly defined, but my impression is the first
dictionary is never a delta dictionary (option 1)
2) I don't think they are prevented from switching state (which I supposed
is more complicated?) but hopefully not by much?
3) Dictionaries are reused across batches unless replaced.
4)  I'm not sure I understand this question.  Dictionary should be passed
independently of indexes?

Thanks,
Micah

On Fri, Jan 19, 2024 at 1:55 PM Chris Larsen 
wrote:


Hi folks,

I'm working on multi-batch dictionary with delta support in Java [1] and
would like some clarifications. Given the "isDelta" flag in the dictionary
message [2], when should this be set to "true"?

1) If we have dictionary with an ID of 1 that we want to delta encode and
it is used across multiple batches, should the initial batch have
`isDelta=false` then subsequent batches have `isDelta=true`? E.g.

batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4]

Or should the flag be true for the entire IPC flow? E.g.

batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3]

Either works for me.

2) Could (in stream, not file IPCs) a single dictionary ever switch state
across batches from delta to replacement mode or vice-versa? E.g.

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2]

I'd like to keep the protocol and API simple and assume switching is not
allowed. This would mean the 2nd example above would be canonical.

3) Are replacement dictionaries required to be serialized for every batch
or is a dictionary re-used across batches until a replacement is received?
The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a
dictionary type must have the same dictionary in each record batch". I
assume (and prefer) the latter, that replacements are serialized once and
re-used. E.g.

batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1]
// use previous dictionary
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2] // replacement

And I assume that 'unify_dictionaries' simply concatenates all dictionaries
into a single vector serialized in the first batch (haven't looked at the
code yet).

4) Is it valid for a delta dictionary to have an update in a subsequent
batch even though the update is not used in that batch? A silly example
would be:

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null,
null, null]
batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2]

Thanks for your help!

[1] https://github.com/apache/arrow/pull/38423
[2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134
[3]

https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE

--


Chris Larsen





[Python][Rust] Is Arrow Rust supposed to support S3-compatible storage with non-public certificates?

2024-01-25 Thread elveshoern32
Since my question remained unanswered on the user list, I dare to ask again on 
the dev list:


While experimenting with polars [1] (which is based on arrow-rs) I found that 
it's not possible to read a single file from our on-prem S3-compatible storage.

Any attempts result in SSL error messages:



error trying to connect: invalid peer certificate: UnknownIssuer



Such SSL errors are well-known to us and usually get fixed by setting the 
environment variable SSL_CERT_FILE (or something similar) pointing to our 
company's certstore.

polars seems to ignore that env var.

Now it's unclear to me whether this is an issue of polars or arrow-rs (or 
anything else).



For more details see [2].



[1] https://pola.rs/

[2] https://github.com/pola-rs/polars/issues/13741