[jira] [Created] (ARROW-8563) Minor change to make newBuilder public

2020-04-22 Thread Amol Umbarkar (Jira)
Amol Umbarkar created ARROW-8563:


 Summary: Minor change to make newBuilder public
 Key: ARROW-8563
 URL: https://issues.apache.org/jira/browse/ARROW-8563
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Amol Umbarkar


This minor change makes newBuilder() public to reduce verbosity for downstream.

To give you example, I am working on a parquet read / write into Arrow Record 
batch where the parquet data types are mapped to Arrow data types.
My Repo: [https://github.com/mindhash/arrow-parquet-go]

In such cases, it will be nice to have a builder API (newBuilder) be generic to 
accept a data type and return a respective array. 

I am looking at a similar situation for JSON reader. I think this change will 
make the builder API much easier for upstream as well as internal packages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-22 Thread Fan Liya
My vote: +1

Best,
Liya Fan

On Thu, Apr 23, 2020 at 8:24 AM Wes McKinney  wrote:

> Hello,
>
> I have proposed adding a simple RecordBatch IPC message body
> compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
> protocol in GitHub PR [1] as discussed on the mailing list [2]. This
> is distinct from separate discussions about adding in-memory encodings
> (like RLE-encoding) to the Arrow columnar format.
>
> This change is not forward compatible so it will not be safe to send
> compressed messages to old libraries, but since we are still pre-1.0.0
> the consensus is that this is acceptable. We may separately consider
> increasing the metadata version for 1.0.0 to require clients to
> upgrade.
>
> Please vote whether to accept the addition. The vote will be open for
> at least 72 hours.
>
> [ ] +1 Accept this addition to the IPC protocol
> [ ] +0
> [ ] -1 Do not accept the changes because...
>
> Here is my vote: +1
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/6707
> [2]:
> https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E
>


[VOTE] Add "trivial" RecordBatch body compression to Arrow IPC protocol

2020-04-22 Thread Wes McKinney
Hello,

I have proposed adding a simple RecordBatch IPC message body
compression scheme (using either LZ4 or ZSTD) to the Arrow IPC
protocol in GitHub PR [1] as discussed on the mailing list [2]. This
is distinct from separate discussions about adding in-memory encodings
(like RLE-encoding) to the Arrow columnar format.

This change is not forward compatible so it will not be safe to send
compressed messages to old libraries, but since we are still pre-1.0.0
the consensus is that this is acceptable. We may separately consider
increasing the metadata version for 1.0.0 to require clients to
upgrade.

Please vote whether to accept the addition. The vote will be open for
at least 72 hours.

[ ] +1 Accept this addition to the IPC protocol
[ ] +0
[ ] -1 Do not accept the changes because...

Here is my vote: +1

Thanks,
Wes

[1]: https://github.com/apache/arrow/pull/6707
[2]: 
https://lists.apache.org/thread.html/r58c9d23ad159644fca590d8f841df80d180b11bfb72f949d601d764b%40%3Cdev.arrow.apache.org%3E


[jira] [Created] (ARROW-8562) [C++] IO: Parameterize I/O coalescing using S3 storage metrics

2020-04-22 Thread Mayur Srivastava (Jira)
Mayur Srivastava created ARROW-8562:
---

 Summary: [C++] IO: Parameterize I/O coalescing using S3 storage 
metrics
 Key: ARROW-8562
 URL: https://issues.apache.org/jira/browse/ARROW-8562
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Mayur Srivastava


Related to https://issues.apache.org/jira/browse/ARROW-7995

The adaptive I/O coalescing algorithm uses two parameters:
1. max_io_gap: Max I/O gap/hole size in bytes
2. ideal_request_size = Ideal I/O Request size in bytes

These parameters can be derived from S3 metrics as described below:

In an S3 compatible storage, there are two main metrics:
1. Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a 
new S3 request
2. Transfer Bandwidth (BW) for data in bytes/sec

1. Computing max_io_gap:

max_io_gap = TTFB * BW

This is also called Bandwidth-Delay-Product (BDP).

Two byte ranges that have a gap can still be mapped to the same read if the gap 
is less than the bandwidt-delay product [TTFB * TransferBandwidth], i.e. if the 
Time-To-First-Byte (or call setup latency of a new S3 request) is expected to 
be greater than just reading and discarding the extra bytes on an existing HTTP 
request.

2. Computing ideal_request_size:

We want to have high bandwidth utilization per S3 connections, i.e. transfer 
large amounts of data to amortize the seek overhead.
But, we also want to leverage parallelism by slicing very large IO chunks. We 
define two more config parameters with suggested default values to control the 
slice size and seek to balance the two effects with the goal of maximizing net 
data load performance.

BW_util (ideal bandwidth utilization):
This means what fraction of per connection bandwidth should be utilized to 
maximinze net data load.
A good default value is 90% or 0.9.

MAX_IDEAL_REQUEST_SIZE:
This means what is the maximum single request size (in bytes) to maximinze net 
data load.
A good default value is 64 MiB.

The amount of data that needs to be transferred in a single S3 get_object 
request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)

Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
following result:
ideal_request_size = max_io_gap * BW_util / (1 - BW_util)

Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
BW_util))

The proposal is to create a named constructor in the io::CacheOptions (PR: 
[https://github.com/apache/arrow/pull/6744] created by [~lidavidm]) to compute 
max_io_gap and ideal_request_size from TTFB and BW which will then be passed to 
reader to configure the I/O coalescing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8561) [C++][Gandiva] Stop using deprecated google::protobuf::MessageLite::ByteSize()

2020-04-22 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8561:
---

 Summary: [C++][Gandiva] Stop using deprecated 
google::protobuf::MessageLite::ByteSize()
 Key: ARROW-8561
 URL: https://issues.apache.org/jira/browse/ARROW-8561
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


It's deprecated since Protobuf 3.4.0.

https://github.com/protocolbuffers/protobuf/blob/v3.4.0/CHANGES.txt#L58-L59

{quote}
  * ByteSize() and SpaceUsed() are deprecated.Use ByteSizeLong() and
SpaceUsedLong() instead
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8560) [Rust] Docs for MutableBuffer resize are incorrect

2020-04-22 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8560:
--

 Summary: [Rust] Docs for MutableBuffer resize are incorrect
 Key: ARROW-8560
 URL: https://issues.apache.org/jira/browse/ARROW-8560
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[Discuss] [Rust] Common Trait(s) for iterating over RecordBatch's

2020-04-22 Thread paddy horan
Hi All,

I just open ARROW-8559 [1] to consolidate the traits for Record Batch 
iterators.  I feel this needs to be done prior to 1.0 as we need to be clear as 
to what external crates should implement to integrate with the Arrow ecosystem. 
 This might be disruptive though so I wanted to bring it to the attention of 
the mailing list.

Paddy

[1] - https://issues.apache.org/jira/browse/ARROW-8559


[jira] [Created] (ARROW-8559) [Rust] Consolidate Record Batch iterator traits in main arrow crate

2020-04-22 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8559:
--

 Summary: [Rust] Consolidate Record Batch iterator traits in main 
arrow crate
 Key: ARROW-8559
 URL: https://issues.apache.org/jira/browse/ARROW-8559
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan


We have the `BatchIterator` trait in DataFusion and the `RecordBatchReader` 
trait in the main arrow crate.

They differ in that `BatchIterator` is Send + Sync.  They should both be in the 
Arrow crate and be named `BatchIterator` and `SendableBatchIterator`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8558) [Rust] GitHub Actions missing rustfmt

2020-04-22 Thread Paddy Horan (Jira)
Paddy Horan created ARROW-8558:
--

 Summary: [Rust] GitHub Actions missing rustfmt
 Key: ARROW-8558
 URL: https://issues.apache.org/jira/browse/ARROW-8558
 Project: Apache Arrow
  Issue Type: New Feature
  Components: CI, Rust
Reporter: Paddy Horan
Assignee: Neville Dipale






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.17.0 - RC0

2020-04-22 Thread Neal Richardson
arrow.apache.org. They looked like 30 minutes before you updated it :/

On Wed, Apr 22, 2020 at 11:14 AM Krisztián Szűcs 
wrote:

> Which website?
>
> On Wed, Apr 22, 2020 at 7:31 PM Neal Richardson
>  wrote:
> >
> > On the post-release tasks, CRAN has accepted the 0.17 release. Homebrew
> > hasn't yet accepted because on initial review, they didn't believe that
> > we'd done the release because the website hadn't been updated yet.
> >
> > Neal
> >
> > On Wed, Apr 22, 2020 at 6:17 AM Wes McKinney 
> wrote:
> >
> > > FTR it seems that the compiler error on VS 2017 on Windows is showing
> > > up elsewhere
> > >
> > >
> > >
> https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=147598=logs=e5cdccbf-4751-5a24-7406-185c9d30d021=1a956d00-e7ee-5aca-9899-9cb1f0a4d4a8=1045
> > >
> > > If the -DCMAKE_UNITY_BUILD=ON workaround doesn't solve it we may have
> > > to make a 0.17.1 release
> > >
> > > On Tue, Apr 21, 2020 at 9:45 AM Wes McKinney 
> wrote:
> > > >
> > > > It looks like the rebase-PR step didn't work correctly per Micah's
> > > > comment (didn't work on my PR for ARROW-2714 either). Might want to
> > > > look into why not
> > > >
> > > > On Tue, Apr 21, 2020 at 6:23 AM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > On Tue, Apr 21, 2020 at 4:28 AM Andy Grove 
> > > wrote:
> > > > > >
> > > > > > Well, I got trhe crates published, but there's a nasty workaround
> > > for users
> > > > > > that want to use these crates as a dependency and it means there
> is
> > > no real
> > > > > > dependency management on the Flight protocol version. I think the
> > > answer is
> > > > > > that we need to publish the Flight.proto as part of the
> arrow-flight
> > > crate
> > > > > > and make sure that version is used in the custom build script.
> I'll
> > > look at
> > > > > > this again tomorrow and try and come up with a solution for the
> next
> > > > > > release.
> > > > > Thanks for handling it Andy!
> > > > >
> > > > > It occasionally happens that typically dependency problems come of
> > > with the
> > > > > crates during the release. Can we automatize the testing of it?
> > > > > >
> > > > > > Here's the JIRA to track this specific issue.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/ARROW-8536
> > > > > I set it to critical for the next version.
> > > > > >
> > > > > > On Mon, Apr 20, 2020 at 7:49 PM Andy Grove <
> andygrov...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > I've run into issues publishing the Rust crates and I don't
> think
> > > I can
> > > > > > > resolve this tonight. I am documenting the issue in
> > > > > > > https://issues.apache.org/jira/browse/ARROW-8535
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 20, 2020 at 5:02 PM Krisztián Szűcs <
> > > szucs.kriszt...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Created a PR with updated docs.
> > > > > > >>
> > > > > > >> Conda post release task is left, it's a bit strange that the
> > > conda-forge
> > > > > > >> autotick bot has not created the version bump PRs yet. I'm
> > > updating
> > > > > > >> them manually tomorrow.
> > > > > > >>
> > > > > > >> 1.  [x] rebase
> > > > > > >> 2.  [x] upload source
> > > > > > >> 3.  [x] upload binaries
> > > > > > >> 4.  [x] update website
> > > > > > >> 5.  [x] upload ruby gems
> > > > > > >> 6.  [x] upload js packages
> > > > > > >> 8.  [x] upload C# packages
> > > > > > >> 9.  [Andy] upload rust crates
> > > > > > >> 10. [ ] update conda recipes
> > > > > > >> 11. [x] upload wheels to pypi
> > > > > > >> 12. [Neal] update homebrew packages
> > > > > > >> 13. [x] update maven artifacts
> > > > > > >> 14. [kou] update msys2
> > > > > > >> 15. [Neal] update R packages
> > > > > > >> 16. [x] update docs
> > > > > > >>
> > > > > > >> I'm going to announce 0.17 once the site PRs get merged.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Thanks,
> > > > > > >> > --
> > > > > > >> > kou
> > > > > > >> >
> > > > > > >> > In  > > u+1f3...@mail.gmail.com>
> > > > > > >> >   "Re: [VOTE] Release Apache Arrow 0.17.0 - RC0" on Mon, 20
> Apr
> > > 2020
> > > > > > >> 23:20:48 +0200,
> > > > > > >> >   Krisztián Szűcs  wrote:
> > > > > > >> >
> > > > > > >> > > On Mon, Apr 20, 2020 at 11:17 PM Andy Grove <
> > > andygrov...@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > >>
> > > > > > >> > >> Ok, I can look into this after work today (in about 3
> hours).
> > > > > > >> > > Great, thanks!
> > > > > > >> > >
> > > > > > >> > > The current status is (`x` means done):
> > > > > > >> > >
> > > > > > >> > > 1.  [x] rebase
> > > > > > >> > > 2.  [x] upload source
> > > > > > >> > > 3.  [x] upload binaries
> > > > > > >> > > 4.  [x] update website
> > > > > > >> > > 5.  [x] upload ruby gems
> > > > > > >> > > 6.  [x] upload js packages
> > > > > > >> > > 8.  [ ] upload C# crates
> > > > > > >> > > 9.  [Andy] upload rust crates
> > > > > > >> > > 10. [ ] update conda recipes
> > > > > > >> > > 11. [x] upload wheels to pypi
> > > > > > >> > > 12. [Neal] 

Re: [VOTE] Release Apache Arrow 0.17.0 - RC0

2020-04-22 Thread Krisztián Szűcs
Which website?

On Wed, Apr 22, 2020 at 7:31 PM Neal Richardson
 wrote:
>
> On the post-release tasks, CRAN has accepted the 0.17 release. Homebrew
> hasn't yet accepted because on initial review, they didn't believe that
> we'd done the release because the website hadn't been updated yet.
>
> Neal
>
> On Wed, Apr 22, 2020 at 6:17 AM Wes McKinney  wrote:
>
> > FTR it seems that the compiler error on VS 2017 on Windows is showing
> > up elsewhere
> >
> >
> > https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=147598=logs=e5cdccbf-4751-5a24-7406-185c9d30d021=1a956d00-e7ee-5aca-9899-9cb1f0a4d4a8=1045
> >
> > If the -DCMAKE_UNITY_BUILD=ON workaround doesn't solve it we may have
> > to make a 0.17.1 release
> >
> > On Tue, Apr 21, 2020 at 9:45 AM Wes McKinney  wrote:
> > >
> > > It looks like the rebase-PR step didn't work correctly per Micah's
> > > comment (didn't work on my PR for ARROW-2714 either). Might want to
> > > look into why not
> > >
> > > On Tue, Apr 21, 2020 at 6:23 AM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > On Tue, Apr 21, 2020 at 4:28 AM Andy Grove 
> > wrote:
> > > > >
> > > > > Well, I got trhe crates published, but there's a nasty workaround
> > for users
> > > > > that want to use these crates as a dependency and it means there is
> > no real
> > > > > dependency management on the Flight protocol version. I think the
> > answer is
> > > > > that we need to publish the Flight.proto as part of the arrow-flight
> > crate
> > > > > and make sure that version is used in the custom build script. I'll
> > look at
> > > > > this again tomorrow and try and come up with a solution for the next
> > > > > release.
> > > > Thanks for handling it Andy!
> > > >
> > > > It occasionally happens that typically dependency problems come of
> > with the
> > > > crates during the release. Can we automatize the testing of it?
> > > > >
> > > > > Here's the JIRA to track this specific issue.
> > > > >
> > > > > https://issues.apache.org/jira/browse/ARROW-8536
> > > > I set it to critical for the next version.
> > > > >
> > > > > On Mon, Apr 20, 2020 at 7:49 PM Andy Grove 
> > wrote:
> > > > >
> > > > > > I've run into issues publishing the Rust crates and I don't think
> > I can
> > > > > > resolve this tonight. I am documenting the issue in
> > > > > > https://issues.apache.org/jira/browse/ARROW-8535
> > > > > >
> > > > > >
> > > > > > On Mon, Apr 20, 2020 at 5:02 PM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Created a PR with updated docs.
> > > > > >>
> > > > > >> Conda post release task is left, it's a bit strange that the
> > conda-forge
> > > > > >> autotick bot has not created the version bump PRs yet. I'm
> > updating
> > > > > >> them manually tomorrow.
> > > > > >>
> > > > > >> 1.  [x] rebase
> > > > > >> 2.  [x] upload source
> > > > > >> 3.  [x] upload binaries
> > > > > >> 4.  [x] update website
> > > > > >> 5.  [x] upload ruby gems
> > > > > >> 6.  [x] upload js packages
> > > > > >> 8.  [x] upload C# packages
> > > > > >> 9.  [Andy] upload rust crates
> > > > > >> 10. [ ] update conda recipes
> > > > > >> 11. [x] upload wheels to pypi
> > > > > >> 12. [Neal] update homebrew packages
> > > > > >> 13. [x] update maven artifacts
> > > > > >> 14. [kou] update msys2
> > > > > >> 15. [Neal] update R packages
> > > > > >> 16. [x] update docs
> > > > > >>
> > > > > >> I'm going to announce 0.17 once the site PRs get merged.
> > > > > >> >
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > --
> > > > > >> > kou
> > > > > >> >
> > > > > >> > In  > u+1f3...@mail.gmail.com>
> > > > > >> >   "Re: [VOTE] Release Apache Arrow 0.17.0 - RC0" on Mon, 20 Apr
> > 2020
> > > > > >> 23:20:48 +0200,
> > > > > >> >   Krisztián Szűcs  wrote:
> > > > > >> >
> > > > > >> > > On Mon, Apr 20, 2020 at 11:17 PM Andy Grove <
> > andygrov...@gmail.com>
> > > > > >> wrote:
> > > > > >> > >>
> > > > > >> > >> Ok, I can look into this after work today (in about 3 hours).
> > > > > >> > > Great, thanks!
> > > > > >> > >
> > > > > >> > > The current status is (`x` means done):
> > > > > >> > >
> > > > > >> > > 1.  [x] rebase
> > > > > >> > > 2.  [x] upload source
> > > > > >> > > 3.  [x] upload binaries
> > > > > >> > > 4.  [x] update website
> > > > > >> > > 5.  [x] upload ruby gems
> > > > > >> > > 6.  [x] upload js packages
> > > > > >> > > 8.  [ ] upload C# crates
> > > > > >> > > 9.  [Andy] upload rust crates
> > > > > >> > > 10. [ ] update conda recipes
> > > > > >> > > 11. [x] upload wheels to pypi
> > > > > >> > > 12. [Neal] update homebrew packages
> > > > > >> > > 13. [x] update maven artifacts
> > > > > >> > > 14. [ ] update msys2
> > > > > >> > > 15. [Neal] update R packages
> > > > > >> > > 16. [Krisztian] update docs
> > > > > >> > >>
> > > > > >> > >> On Mon, Apr 20, 2020, 2:47 PM Krisztián Szűcs <
> > > > > >> szucs.kriszt...@gmail.com>
> > > > > >> > >> wrote:
> > > > > >> > >>
> > > > > >> > >> > Thanks Andy! I tried to upload the rust 

[jira] [Created] (ARROW-8557) from pyarrow import parquet fails with AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute '__reduce_cython__'

2020-04-22 Thread Haluk Tokgozoglu (Jira)
Haluk Tokgozoglu created ARROW-8557:
---

 Summary: from pyarrow import parquet fails with AttributeError: 
type object 'pyarrow._parquet.Statistics' has no attribute '__reduce_cython__'
 Key: ARROW-8557
 URL: https://issues.apache.org/jira/browse/ARROW-8557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.0, 0.16.0, 0.15.1
 Environment: Python 3.8.4, GCC 4.8.4, Debian 8
Reporter: Haluk Tokgozoglu


I have tried versions 0.15.1, 0.16.0, 0.17.0. Same error on all. I've seen in 
other issues that co-installations of tensorflow and numpy might be causing 
issues. I have tensorflow==1.14.0 and numpy==1.16.4 (and many other libraries, 
but I've read that those tend to cause issues)

 

{{}}

 
{code:java}
from pyarrow import parquet
 
~/python/lib/python3.6/site-packages/pyarrow/parquet.py in 
 32 import pyarrow as pa
 33 import pyarrow.lib as lib
---> 34 import pyarrow._parquet as _parquet
 35 
 36 from pyarrow._parquet import (ParquetReader, Statistics, # noqa
~/python/lib/python3.6/site-packages/pyarrow/_parquet.pyx in init 
pyarrow._parquet()
 
AttributeError: type object 'pyarrow._parquet.Statistics' has no attribute 
'__reduce_cython__'
{code}
{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.17.0 - RC0

2020-04-22 Thread Neal Richardson
On the post-release tasks, CRAN has accepted the 0.17 release. Homebrew
hasn't yet accepted because on initial review, they didn't believe that
we'd done the release because the website hadn't been updated yet.

Neal

On Wed, Apr 22, 2020 at 6:17 AM Wes McKinney  wrote:

> FTR it seems that the compiler error on VS 2017 on Windows is showing
> up elsewhere
>
>
> https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=147598=logs=e5cdccbf-4751-5a24-7406-185c9d30d021=1a956d00-e7ee-5aca-9899-9cb1f0a4d4a8=1045
>
> If the -DCMAKE_UNITY_BUILD=ON workaround doesn't solve it we may have
> to make a 0.17.1 release
>
> On Tue, Apr 21, 2020 at 9:45 AM Wes McKinney  wrote:
> >
> > It looks like the rebase-PR step didn't work correctly per Micah's
> > comment (didn't work on my PR for ARROW-2714 either). Might want to
> > look into why not
> >
> > On Tue, Apr 21, 2020 at 6:23 AM Krisztián Szűcs
> >  wrote:
> > >
> > > On Tue, Apr 21, 2020 at 4:28 AM Andy Grove 
> wrote:
> > > >
> > > > Well, I got trhe crates published, but there's a nasty workaround
> for users
> > > > that want to use these crates as a dependency and it means there is
> no real
> > > > dependency management on the Flight protocol version. I think the
> answer is
> > > > that we need to publish the Flight.proto as part of the arrow-flight
> crate
> > > > and make sure that version is used in the custom build script. I'll
> look at
> > > > this again tomorrow and try and come up with a solution for the next
> > > > release.
> > > Thanks for handling it Andy!
> > >
> > > It occasionally happens that typically dependency problems come of
> with the
> > > crates during the release. Can we automatize the testing of it?
> > > >
> > > > Here's the JIRA to track this specific issue.
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-8536
> > > I set it to critical for the next version.
> > > >
> > > > On Mon, Apr 20, 2020 at 7:49 PM Andy Grove 
> wrote:
> > > >
> > > > > I've run into issues publishing the Rust crates and I don't think
> I can
> > > > > resolve this tonight. I am documenting the issue in
> > > > > https://issues.apache.org/jira/browse/ARROW-8535
> > > > >
> > > > >
> > > > > On Mon, Apr 20, 2020 at 5:02 PM Krisztián Szűcs <
> szucs.kriszt...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Created a PR with updated docs.
> > > > >>
> > > > >> Conda post release task is left, it's a bit strange that the
> conda-forge
> > > > >> autotick bot has not created the version bump PRs yet. I'm
> updating
> > > > >> them manually tomorrow.
> > > > >>
> > > > >> 1.  [x] rebase
> > > > >> 2.  [x] upload source
> > > > >> 3.  [x] upload binaries
> > > > >> 4.  [x] update website
> > > > >> 5.  [x] upload ruby gems
> > > > >> 6.  [x] upload js packages
> > > > >> 8.  [x] upload C# packages
> > > > >> 9.  [Andy] upload rust crates
> > > > >> 10. [ ] update conda recipes
> > > > >> 11. [x] upload wheels to pypi
> > > > >> 12. [Neal] update homebrew packages
> > > > >> 13. [x] update maven artifacts
> > > > >> 14. [kou] update msys2
> > > > >> 15. [Neal] update R packages
> > > > >> 16. [x] update docs
> > > > >>
> > > > >> I'm going to announce 0.17 once the site PRs get merged.
> > > > >> >
> > > > >> >
> > > > >> > Thanks,
> > > > >> > --
> > > > >> > kou
> > > > >> >
> > > > >> > In  u+1f3...@mail.gmail.com>
> > > > >> >   "Re: [VOTE] Release Apache Arrow 0.17.0 - RC0" on Mon, 20 Apr
> 2020
> > > > >> 23:20:48 +0200,
> > > > >> >   Krisztián Szűcs  wrote:
> > > > >> >
> > > > >> > > On Mon, Apr 20, 2020 at 11:17 PM Andy Grove <
> andygrov...@gmail.com>
> > > > >> wrote:
> > > > >> > >>
> > > > >> > >> Ok, I can look into this after work today (in about 3 hours).
> > > > >> > > Great, thanks!
> > > > >> > >
> > > > >> > > The current status is (`x` means done):
> > > > >> > >
> > > > >> > > 1.  [x] rebase
> > > > >> > > 2.  [x] upload source
> > > > >> > > 3.  [x] upload binaries
> > > > >> > > 4.  [x] update website
> > > > >> > > 5.  [x] upload ruby gems
> > > > >> > > 6.  [x] upload js packages
> > > > >> > > 8.  [ ] upload C# crates
> > > > >> > > 9.  [Andy] upload rust crates
> > > > >> > > 10. [ ] update conda recipes
> > > > >> > > 11. [x] upload wheels to pypi
> > > > >> > > 12. [Neal] update homebrew packages
> > > > >> > > 13. [x] update maven artifacts
> > > > >> > > 14. [ ] update msys2
> > > > >> > > 15. [Neal] update R packages
> > > > >> > > 16. [Krisztian] update docs
> > > > >> > >>
> > > > >> > >> On Mon, Apr 20, 2020, 2:47 PM Krisztián Szűcs <
> > > > >> szucs.kriszt...@gmail.com>
> > > > >> > >> wrote:
> > > > >> > >>
> > > > >> > >> > Thanks Andy! I tried to upload the rust packages but
> arrow-flight,
> > > > >> > >> > but a version pin is missing from the package tree:
> > > > >> > >> >
> > > > >> > >> > error: all dependencies must have a version specified when
> > > > >> publishing.
> > > > >> > >> > dependency `arrow-flight` does not specify a version
> > > > >> > >> >
> > > > >> > >> > Please upload 

[jira] [Created] (ARROW-8556) [R] Installation fails with `LIBARROW_MINIMAL=false`

2020-04-22 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-8556:
-

 Summary: [R] Installation fails with `LIBARROW_MINIMAL=false`
 Key: ARROW-8556
 URL: https://issues.apache.org/jira/browse/ARROW-8556
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.17.0
 Environment: Ubuntu 19.10
R 3.6.1
Reporter: Karl Dunkle Werner


I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
Prebuilt binaries are unavailable, and I want to enable compression, so I set 
the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
like the package is able to compile, but can't be loaded. I'm able to install 
correctly if I don't set the {{LIBARROW_MINIMAL}} variable.

Here's the error I get:
{code:java}
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
  ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
ZSTD_initCStream
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘~/.R/3.6/arrow’
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8555) [FlightRPC][Java] Implement Flight DoExchange for Java

2020-04-22 Thread David Li (Jira)
David Li created ARROW-8555:
---

 Summary: [FlightRPC][Java] Implement Flight DoExchange for Java
 Key: ARROW-8555
 URL: https://issues.apache.org/jira/browse/ARROW-8555
 Project: Apache Arrow
  Issue Type: New Feature
  Components: FlightRPC
Reporter: David Li
Assignee: David Li


As described in the mailing list vote.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: 0.17 release blog post: help needed

2020-04-22 Thread Wes McKinney
No, nothing major that needs to be copied over

On Tue, Apr 21, 2020 at 5:15 PM Neal Richardson
 wrote:
>
> Hope you weren't editing the google doc while I was moving it to
> https://github.com/apache/arrow-site/pull/55. If so, would you mind copying
> over any relevant changes?
>
> Neal
>
>
> On Tue, Apr 21, 2020 at 1:53 PM Wes McKinney  wrote:
>
> > I did a few tweaks and cleanups. There are still a number of TODO
> > items in this document. It would be good to finish (or remove) these
> > so this can be published tomorrow or Thursday
> >
> > On Mon, Apr 20, 2020 at 7:47 AM Fan Liya  wrote:
> > >
> > > I have added some Java items.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Mon, Apr 20, 2020 at 10:49 AM Kenta Murata  wrote:
> > >
> > > > I've edited Ruby and C GLib parts.
> > > > Kou and Shiro will check them later.
> > > >
> > > > 2020年4月20日(月) 11:09 Wes McKinney :
> > > > >
> > > > > I made a pass through the changelog and added a bunch of TODOs
> > related
> > > > > to C++. In general, as a reminder, in these blog posts since the
> > > > > releases are growing large we should try to present as compact a high
> > > > > level summary as possible to convey some of the highlights of our
> > > > > labors (so likely not needed to write out any JIRA numbers, people
> > can
> > > > > look at the changelog for that). I'll spend some more time on the
> > blog
> > > > > post after others have had a chance to take a pass through
> > > > >
> > > > > On Sat, Apr 18, 2020 at 12:13 PM Neal Richardson
> > > > >  wrote:
> > > > > >
> > > > > > Hi all,
> > > > > > Since it looks like we're close to releasing 0.17, we need to fill
> > in
> > > > the
> > > > > > details for our blog post announcement. I've started a document
> > here:
> > > > > >
> > > >
> > https://docs.google.com/document/d/16UKZtvL49o8nCDN8JU3Ut6y76Y9d8-4qXv5vFv7aNvs/edit#heading=h.kqqacbm2lpv8
> > > > > >
> > > > > > Please fill in the details for the parts of the project you're
> > close
> > > > to.
> > > > > > I'll handle wrapping this up in the usual boilerplate when we're
> > done.
> > > > > >
> > > > > > Thanks,
> > > > > > Neal
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Kenta Murata
> > > >
> >


[jira] [Created] (ARROW-8554) [C++][Benchmark] Fix building error "cannot bind lvalue"

2020-04-22 Thread Jiajia Li (Jira)
Jiajia Li created ARROW-8554:


 Summary: [C++][Benchmark] Fix building error "cannot bind lvalue"
 Key: ARROW-8554
 URL: https://issues.apache.org/jira/browse/ARROW-8554
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking
Reporter: Jiajia Li


When running the commads:

```

cmake -DARROW_BUILD_BENCHMARKS=ON ..
make

```

with following error:

bit_util_benchmark.cc:96:10: error: cannot bind 
‘std::unique_ptr’ lvalue to ‘std::unique_ptr&&’
 return buffer;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [VOTE] Release Apache Arrow 0.17.0 - RC0

2020-04-22 Thread Wes McKinney
FTR it seems that the compiler error on VS 2017 on Windows is showing
up elsewhere

https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=147598=logs=e5cdccbf-4751-5a24-7406-185c9d30d021=1a956d00-e7ee-5aca-9899-9cb1f0a4d4a8=1045

If the -DCMAKE_UNITY_BUILD=ON workaround doesn't solve it we may have
to make a 0.17.1 release

On Tue, Apr 21, 2020 at 9:45 AM Wes McKinney  wrote:
>
> It looks like the rebase-PR step didn't work correctly per Micah's
> comment (didn't work on my PR for ARROW-2714 either). Might want to
> look into why not
>
> On Tue, Apr 21, 2020 at 6:23 AM Krisztián Szűcs
>  wrote:
> >
> > On Tue, Apr 21, 2020 at 4:28 AM Andy Grove  wrote:
> > >
> > > Well, I got trhe crates published, but there's a nasty workaround for 
> > > users
> > > that want to use these crates as a dependency and it means there is no 
> > > real
> > > dependency management on the Flight protocol version. I think the answer 
> > > is
> > > that we need to publish the Flight.proto as part of the arrow-flight crate
> > > and make sure that version is used in the custom build script. I'll look 
> > > at
> > > this again tomorrow and try and come up with a solution for the next
> > > release.
> > Thanks for handling it Andy!
> >
> > It occasionally happens that typically dependency problems come of with the
> > crates during the release. Can we automatize the testing of it?
> > >
> > > Here's the JIRA to track this specific issue.
> > >
> > > https://issues.apache.org/jira/browse/ARROW-8536
> > I set it to critical for the next version.
> > >
> > > On Mon, Apr 20, 2020 at 7:49 PM Andy Grove  wrote:
> > >
> > > > I've run into issues publishing the Rust crates and I don't think I can
> > > > resolve this tonight. I am documenting the issue in
> > > > https://issues.apache.org/jira/browse/ARROW-8535
> > > >
> > > >
> > > > On Mon, Apr 20, 2020 at 5:02 PM Krisztián Szűcs 
> > > > 
> > > > wrote:
> > > >
> > > >> Created a PR with updated docs.
> > > >>
> > > >> Conda post release task is left, it's a bit strange that the 
> > > >> conda-forge
> > > >> autotick bot has not created the version bump PRs yet. I'm updating
> > > >> them manually tomorrow.
> > > >>
> > > >> 1.  [x] rebase
> > > >> 2.  [x] upload source
> > > >> 3.  [x] upload binaries
> > > >> 4.  [x] update website
> > > >> 5.  [x] upload ruby gems
> > > >> 6.  [x] upload js packages
> > > >> 8.  [x] upload C# packages
> > > >> 9.  [Andy] upload rust crates
> > > >> 10. [ ] update conda recipes
> > > >> 11. [x] upload wheels to pypi
> > > >> 12. [Neal] update homebrew packages
> > > >> 13. [x] update maven artifacts
> > > >> 14. [kou] update msys2
> > > >> 15. [Neal] update R packages
> > > >> 16. [x] update docs
> > > >>
> > > >> I'm going to announce 0.17 once the site PRs get merged.
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> > --
> > > >> > kou
> > > >> >
> > > >> > In 
> > > >> > 
> > > >> >   "Re: [VOTE] Release Apache Arrow 0.17.0 - RC0" on Mon, 20 Apr 2020
> > > >> 23:20:48 +0200,
> > > >> >   Krisztián Szűcs  wrote:
> > > >> >
> > > >> > > On Mon, Apr 20, 2020 at 11:17 PM Andy Grove 
> > > >> wrote:
> > > >> > >>
> > > >> > >> Ok, I can look into this after work today (in about 3 hours).
> > > >> > > Great, thanks!
> > > >> > >
> > > >> > > The current status is (`x` means done):
> > > >> > >
> > > >> > > 1.  [x] rebase
> > > >> > > 2.  [x] upload source
> > > >> > > 3.  [x] upload binaries
> > > >> > > 4.  [x] update website
> > > >> > > 5.  [x] upload ruby gems
> > > >> > > 6.  [x] upload js packages
> > > >> > > 8.  [ ] upload C# crates
> > > >> > > 9.  [Andy] upload rust crates
> > > >> > > 10. [ ] update conda recipes
> > > >> > > 11. [x] upload wheels to pypi
> > > >> > > 12. [Neal] update homebrew packages
> > > >> > > 13. [x] update maven artifacts
> > > >> > > 14. [ ] update msys2
> > > >> > > 15. [Neal] update R packages
> > > >> > > 16. [Krisztian] update docs
> > > >> > >>
> > > >> > >> On Mon, Apr 20, 2020, 2:47 PM Krisztián Szűcs <
> > > >> szucs.kriszt...@gmail.com>
> > > >> > >> wrote:
> > > >> > >>
> > > >> > >> > Thanks Andy! I tried to upload the rust packages but 
> > > >> > >> > arrow-flight,
> > > >> > >> > but a version pin is missing from the package tree:
> > > >> > >> >
> > > >> > >> > error: all dependencies must have a version specified when
> > > >> publishing.
> > > >> > >> > dependency `arrow-flight` does not specify a version
> > > >> > >> >
> > > >> > >> > Please upload the packages!
> > > >> > >> >
> > > >> > >> > Also added Uwe and Kou to the package owners.
> > > >> > >> >
> > > >> > >> > On Mon, Apr 20, 2020 at 10:24 PM Andy Grove 
> > > >> > >> > 
> > > >> wrote:
> > > >> > >> > >
> > > >> > >> > > You should have an invite for the arrow-flight crate. Please
> > > >> check
> > > >> > >> > > https://crates.io/me/pending-invites
> > > >> > >> > >
> > > >> > >> > > On Mon, Apr 20, 2020 at 2:10 PM Krisztián Szűcs <
> > > >> > >> > szucs.kriszt...@gmail.com>
> > > >> > >> > > wrote:
> > > >> > >> 

Re: [C++] Revamping approach to Arrow compute kernel development

2020-04-22 Thread Wes McKinney
On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield  wrote:
>
> Hi Wes,
> I haven't had time to read the doc, but wanted to ask some questions on
> points raised on the thread.
>
> * For efficiency, kernels used for array-expr evaluation should write
> > into preallocated memory as their default mode. This enables the
> > interpreter to avoid temporary memory allocations and improve CPU
> > cache utilization. Almost none of our kernels are implemented this way
> > currently.
>
> Did something change, I was pretty sure I submitted a patch a while ago for
> boolean kernels, that separated out memory allocation from computation.
> Which should allow for writing to the same memory.  Is this a concern with
> the public Function APIs for the Kernel APIs themselves, or a lower level
> implementation concern?

Yes, you did in the internal implementation [1]. The concern is the
public API and the general approach to implementing new kernels.

I'm working on this right now (it's a large project so it will take me
a little while to produce something to be reviewed) so bear with me =)

[1]: 
https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76

> * Sorting is generally handled by different data processing nodes from
> > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > Projections and Filters use expressions, they do not sort.
>
> Would sorting the list-column elements per row be an array-expr?

Yes, as that's an element-wise function. When I said sorting I was
referring to ORDER BY. The functions we have that do sorting do so in
the context of a single array [2].

A query engine must be able to sort a (potentially very large) stream
of record batches. One approach is for the Sort operator to exhaust
its child input, accumulating all of the record batches in memory
(spilling to disk as needed) and then sorting and emitting record
batches from the sorted records/tuples. See e.g. Impala's sorting code
[3] [4]

[2]: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
[3]: https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
[4]: https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h

>
> On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney  wrote:
>
> > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou  wrote:
> > >
> > >
> > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
> > > >>
> > > >> That said, in the SortToIndices case, this wouldn't be a problem,
> > since
> > > >> only the second pass writes to the output.
> > > >
> > > > This kernel is not valid for normal array-exprs (see the spreadsheet I
> > > > linked), such as what you can write in SQL
> > > >
> > > > Kernels like SortToIndices are a different type of function (in other
> > > > words, "not a SQL function") and so if we choose to allow such a
> > > > "non-SQL-like" functions in the expression evaluator then different
> > > > logic must be used.
> > >
> > > Hmm, I think that maybe I'm misunderstanding at which level we're
> > > talking here.  SortToIndices() may not be a "SQL function", but it looks
> > > like an important basic block for a query engine (since, after all,
> > > sorting results is an often used feature in SQL and other languages).
> > > So it should be usable *inside* the expression engine, even though it's
> > > not part of the exposed vocabulary, no?
> >
> > No, not as part of "expressions" as they are defined in the context of
> > SQL engines.
> >
> > Sorting is generally handled by different data processing nodes from
> > Projections, Aggregations / Hash Aggregations, Filters, and Joins.
> > Projections and Filters use expressions, they do not sort.
> >
> > > Regards
> > >
> > > Antoine.
> >


Re: [C++] Big-endian support

2020-04-22 Thread Wes McKinney
hi Kazuaki

On Wed, Apr 22, 2020 at 12:41 AM Kazuaki Ishizaki  wrote:
>
> Thank you for your comments. I see that the developers would assist of
> other parts, too.
>
> For developing OSS on big-endian, here are resource for an environment and
> CI. They would be helpful for code review, too.
> A trial zLinux VM for OSS development is available. Once we create a VM
> with RHEL or SLES, it is available up to 120 days. The procedure to create
> a VM is available at
> https://github.com/linuxone-community-cloud/technical-resources/blob/master/deploy-virtual-server.md
> .
> Regarding CI, TravisCI on zLinux is available. The article is available at
> https://blog.travis-ci.com/2019-11-12-multi-cpu-architecture-ibm-power-ibm-z

This is good to know. I think we will need you or one of your
colleagues to contribute to the setup and maintenance of this in the
project's CI infrastructure.

>
> Kazuaki Ishizaki,
>
>
>
> From:   Wes McKinney 
> To: dev 
> Date:   2020/04/21 21:11
> Subject:[EXTERNAL] Re: [C++] Big-endian support
>
>
>
> I will add that I think big-endian support would be valuable so that
> the library can be used everywhere, including more exotic mainframe
> type systems like IBM Z.
>
> That said, the code review burden to other C++ developers is likely to
> become significant, so a solo developer with access to big-endian
> hardware submitting pull requests could be problematic since no one
> else with close knowledge of the codebase has a need to support
> big-endian. That said, if big-endian developers would assist with
> other parts of the C++ project as a sort of "quid-pro-quo" to balance
> the time spent on code review relating to big-endian that would be
> helpful.
>
> On Mon, Apr 20, 2020 at 12:38 PM Antoine Pitrou 
> wrote:
> >
> >
> > Hello,
> >
> > Recently some issues have been opened for big-endian support (i.e.
> > support for big-endian *hosts*), and a couple patches submitted, thanks
> > to Kazuaki Ishizaki.  See e.g.:
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ARROW-2D8457=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LDU6YozRdOdA8sz-N3IT2-1CDlHn6-VgsQhvAmqcjF0=wWPbfEjThpmG7B3LCiHadi28EXcx7v7yhYYAZ8p80cI=
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ARROW-2D8467=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LDU6YozRdOdA8sz-N3IT2-1CDlHn6-VgsQhvAmqcjF0=xuVttzSLurzBSLUpBFdnMwWtZ7rKCbEcgjCYm72K2QY=
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ARROW-2D8486=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LDU6YozRdOdA8sz-N3IT2-1CDlHn6-VgsQhvAmqcjF0=StvnEO4FScjt-7328AEqPbMEe-fLs-Ms2g94VHkYHF4=
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ARROW-2D8506=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LDU6YozRdOdA8sz-N3IT2-1CDlHn6-VgsQhvAmqcjF0=U6wwz875yuTkN4WdS7v_zB4SjIyooH6bgeVh57ByPnE=
>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_PARQUET-2D1845=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg=LDU6YozRdOdA8sz-N3IT2-1CDlHn6-VgsQhvAmqcjF0=ZrNKpegyRlg0SErlYs1FOdBjElPuUzHdxRINWQIzn98=
>
> >
> > Achieving big-endian support support accross the C++ Arrow and Parquet
> > codebases is likely to be a very significant effort, potentially
> > requiring cooperation between multiple developers.  An additional
> > problem is that, without any Continuous Integration set up, it will be
> > impossible to ensure progress and be notified of regressions.
> >
> > If other people are seriously interested in the desired outcome, they
> > should probably team up with Kazuaki Ishizaki and discuss a practical
> > plan to avoid drowning in the difficulties.
> >
> > Regards
> >
> > Antoine.
>
>
>
>


Re: [C++] Big-endian support

2020-04-22 Thread Wes McKinney
On Wed, Apr 22, 2020 at 12:05 AM Micah Kornfield  wrote:
>
> >
> > That said, if big-endian developers would assist with
> > other parts of the C++ project as a sort of "quid-pro-quo" to balance
> > the time spent on code review relating to big-endian that would be
> > helpful.
>
> I think setting/resetting up setting up CI would need to be included in
> this, otherwise even with in-depth reviews, I think it will be easy to
> forget about big endian architectures.
>
> An additional
> > problem is that, without any Continuous Integration set up, it will be
> > impossible to ensure progress and be notified of regressions.
>
> This might be hijacking the thread, but I think we might have similar
> issues for AVX-512 specific code?

Yes, this is true, but AVX-512-capable machines are significantly less
exotic (I develop on one -- i9-9960X -- for example)

> Thanks,
> Micah
>
>
> On Tue, Apr 21, 2020 at 5:10 AM Wes McKinney  wrote:
>
> > I will add that I think big-endian support would be valuable so that
> > the library can be used everywhere, including more exotic mainframe
> > type systems like IBM Z.
> >
> > That said, the code review burden to other C++ developers is likely to
> > become significant, so a solo developer with access to big-endian
> > hardware submitting pull requests could be problematic since no one
> > else with close knowledge of the codebase has a need to support
> > big-endian. That said, if big-endian developers would assist with
> > other parts of the C++ project as a sort of "quid-pro-quo" to balance
> > the time spent on code review relating to big-endian that would be
> > helpful.
> >
> > On Mon, Apr 20, 2020 at 12:38 PM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > Recently some issues have been opened for big-endian support (i.e.
> > > support for big-endian *hosts*), and a couple patches submitted, thanks
> > > to Kazuaki Ishizaki.  See e.g.:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-8457
> > > https://issues.apache.org/jira/browse/ARROW-8467
> > > https://issues.apache.org/jira/browse/ARROW-8486
> > > https://issues.apache.org/jira/browse/ARROW-8506
> > > https://issues.apache.org/jira/browse/PARQUET-1845
> > >
> > > Achieving big-endian support support accross the C++ Arrow and Parquet
> > > codebases is likely to be a very significant effort, potentially
> > > requiring cooperation between multiple developers.  An additional
> > > problem is that, without any Continuous Integration set up, it will be
> > > impossible to ensure progress and be notified of regressions.
> > >
> > > If other people are seriously interested in the desired outcome, they
> > > should probably team up with Kazuaki Ishizaki and discuss a practical
> > > plan to avoid drowning in the difficulties.
> > >
> > > Regards
> > >
> > > Antoine.
> >


Re: [DISCUSS] Reducing scope of work for Arrow 1.0.0 release

2020-04-22 Thread Wes McKinney
hi Micah,

I'm not saying that I think the work definitely will not be completed,
but rather that we should put a date on the calendar as the target
date for 1.0.0 and stick to it. If the work gets done, that's great.

10 to 12 weeks from now would mean releasing 1.0.0 either the week of
June 29 or July 6. That is about 1 year since we discussed and adopted
our SemVer policy [1]

> I would propose that if there isn't an implementation in any language we
> might drop it as part of the specification.  The main feature that I think
> meets this criteria is the Dictionary of Dictionary columns (Is this
> supported in C++)?

I don't have a strong view on this, but IIUC this is implemented in
JavaScript and probably not far off in C++.

- Wes

[1]: 
https://lists.apache.org/thread.html/2a630234214e590eb184c24bbf9dac4a8d8f7677d85a75fa49d70ba8%40%3Cdev.arrow.apache.org%3E

On Wed, Apr 22, 2020 at 12:26 AM Micah Kornfield  wrote:
>
> Hi Wes,
> I think we might be closer than we think on the Java side to having the
> functionality listed (I've added comments inline at the end with the
> features you listed in the original e-mail).
>
> My biggest concern is I don't think there is a clear path forward for
> Sparse Unions.  Getting compatibility for Sparse unions would be more
> invasive/breaking changes to the java code base.  [1] is the last thread on
> the issue.  I sadly have not had time to get back to this, nor will I
> probably have time before the next release.
>
> I would propose that if there isn't an implementation in any language we
> might drop it as part of the specification.  The main feature that I think
> meets this criteria is the Dictionary of Dictionary columns (Is this
> supported in C++)?
>
> Thanks,
> Micah
>
>
> * custom_metadata fields
>
> Not sure about this one.
>
> > * Extension Types
>
> There is an implementation already in Java, probably. needs more work for
> integration testing.
>
> * Large (64-bit offset) variable size types
>
> there is an open PR for string/binary types.  LargeList is of more
> questionable value until Java supports vectors/arrays with more than 2^32
> elements.
>
> * Delta and Replacement Dictionaries
>
> There is an implementation already in Java, probably needs more work for
> specifically for integration testing.
>
> > * Unions
>
> There is an implementation for dense unions (likely needs more work for
> integration testing).
>
> On Tue, Apr 21, 2020 at 11:26 AM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
>
> > I'm all for making our next release be 1.0. Everything is about tradeoffs,
> > and while I too would like to see a complete Java implementation, I think
> > the costs of further delaying 1.0 outweigh the benefits of holding it
> > indefinitely in hopes that there will be enough availability of Java
> > developers to finish integration testing.
> >
> > Neal
> >
> > On Tue, Apr 21, 2020 at 10:55 AM Wes McKinney  wrote:
> >
> > > hi Bryan -- with the way that things are going, if we were to block
> > > the 1.0.0 release on completing the Java work, it could be a very long
> > > time to wait (long time = more than 6 months from now). I don't think
> > > that's acceptable. The Versioning document was formally adopted last
> > > August and so a year will have soon elapsed since we previously said
> > > we wanted to have everything integration tested.
> > >
> > > With what I'm proposing the primary things that would not be tested
> > > (if no progress in Java):
> > >
> > > * custom_metadata fields
> > > * Extension Types
> > > * Large (64-bit offset) variable size types
> > > * Delta and Replacement Dictionaries
> > > * Unions
> > >
> > > These do not seem like huge sacrifices, or at least not ones that
> > > compromise the stability of the columnar format. Of course, if some of
> > > them are completed in the next 10-12 weeks, then that's great.
> > >
> > > - Wes
> > >
> > > On Tue, Apr 21, 2020 at 12:12 PM Bryan Cutler  wrote:
> > > >
> > > > I really would like to see a 1.0.0 release with complete
> > implementations
> > > > for C++ and Java. From my experience, that interoperability has been a
> > > > major selling point for the project. That being said, my time for
> > > > contributions has been pretty limited lately and I know that Java has
> > > been
> > > > lagging, so if the rest of the community would like to push forward
> > with
> > > a
> > > > reduced scope, that is okay with me. I'll still continue to do what I
> > can
> > > > on Java to fill in the gaps.
> > > >
> > > > Bryan
> > > >
> > > > On Tue, Apr 21, 2020 at 8:47 AM Wes McKinney 
> > > wrote:
> > > >
> > > > > Hi all -- are there some opinions about this?
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Thu, Apr 16, 2020 at 5:30 PM Wes McKinney 
> > > wrote:
> > > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > Previously we had discussed a plan for making a 1.0.0 release based
> > > on
> > > > > > completeness of columnar format integration tests and making
> > > > > > 

[jira] [Created] (ARROW-8553) [C++] Reimplement BitmapAnd using Bitmap::VisitWords

2020-04-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8553:
-

 Summary: [C++] Reimplement BitmapAnd using Bitmap::VisitWords
 Key: ARROW-8553
 URL: https://issues.apache.org/jira/browse/ARROW-8553
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.17.0
Reporter: Antoine Pitrou


Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using 
{{Bitmap::VisitWords}} instead would probably yield a manyfold performance 
increase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-04-22-0

2020-04-22 Thread Crossbow


Arrow Build Report for Job nightly-2020-04-22-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0

Failed Tasks:
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-turbodbc-latest

Succeeded Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-centos-6-amd64
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-centos-8-amd64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-debian-buster-amd64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-debian-stretch-amd64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-github-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-pandas-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-azure-test-conda-python-3.7
- test-conda-python-3.8-dask-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.8-dask-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-22-0-circle-test-conda-python-3.8-jpype
-