Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Uwe L. Korn
+1 (binding)

On Fri, Mar 1, 2024, at 2:37 PM, Andy Grove wrote:
> +1 (binding)
>
> On Fri, Mar 1, 2024 at 6:20 AM Weston Pace  wrote:
>
>> +1 (binding)
>>
>> On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb  wrote:
>>
>> > Hello,
>> >
>> > As we have discussed[1][2] I would like to vote on the proposal to
>> > create a new Apache Top Level Project for DataFusion. The text of the
>> > proposed resolution and background document is copy/pasted below
>> >
>> > If the community is in favor of this, we plan to submit the resolution
>> > to the ASF board for approval with the next Arrow report (for the
>> > April 2024 board meeting).
>> >
>> > The vote will be open for at least 7 days.
>> >
>> > [ ] +1 Accept this Proposal
>> > [ ] +0
>> > [ ] -1 Do not accept this proposal because...
>> >
>> > Andrew
>> >
>> > [1] https://lists.apache.org/thread/c150t1s1x0kcb3r03cjyx31kqs5oc341
>> > [2] https://github.com/apache/arrow-datafusion/discussions/6475
>> >
>> > -- Proposed Resolution -
>> >
>> > Resolution to Create the Apache DataFusion Project from the Apache
>> > Arrow DataFusion Sub Project
>> >
>> > =
>> >
>> > X. Establish the Apache DataFusion Project
>> >
>> > WHEREAS, the Board of Directors deems it to be in the best
>> > interests of the Foundation and consistent with the
>> > Foundation's purpose to establish a Project Management
>> > Committee charged with the creation and maintenance of
>> > open-source software related to an extensible query engine
>> > for distribution at no charge to the public.
>> >
>> > NOW, THEREFORE, BE IT RESOLVED, that a Project Management
>> > Committee (PMC), to be known as the "Apache DataFusion Project",
>> > be and hereby is established pursuant to Bylaws of the
>> > Foundation; and be it further
>> >
>> > RESOLVED, that the Apache DataFusion Project be and hereby is
>> > responsible for the creation and maintenance of software
>> > related to an extensible query engine; and be it further
>> >
>> > RESOLVED, that the office of "Vice President, Apache DataFusion" be
>> > and hereby is created, the person holding such office to
>> > serve at the direction of the Board of Directors as the chair
>> > of the Apache DataFusion Project, and to have primary responsibility
>> > for management of the projects within the scope of
>> > responsibility of the Apache DataFusion Project; and be it further
>> >
>> > RESOLVED, that the persons listed immediately below be and
>> > hereby are appointed to serve as the initial members of the
>> > Apache DataFusion Project:
>> >
>> > * Andy Grove (agr...@apache.org)
>> > * Andrew Lamb (al...@apache.org)
>> > * Daniël Heres (dhe...@apache.org)
>> > * Jie Wen (jake...@apache.org)
>> > * Kun Liu (liu...@apache.org)
>> > * Liang-Chi Hsieh (vii...@apache.org)
>> > * Qingping Hou: (ho...@apache.org)
>> > * Wes McKinney(w...@apache.org)
>> > * Will Jones (wjones...@apache.org)
>> >
>> > RESOLVED, that the Apache DataFusion Project be and hereby
>> > is tasked with the migration and rationalization of the Apache
>> > Arrow DataFusion sub-project; and be it further
>> >
>> > RESOLVED, that all responsibilities pertaining to the Apache
>> > Arrow DataFusion sub-project encumbered upon the
>> > Apache Arrow Project are hereafter discharged.
>> >
>> > NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrew Lamb
>> > be appointed to the office of Vice President, Apache DataFusion, to
>> > serve in accordance with and subject to the direction of the
>> > Board of Directors and the Bylaws of the Foundation until
>> > death, resignation, retirement, removal or disqualification,
>> > or until a successor is appointed.
>> > =
>> >
>> >
>> > ---
>> >
>> >
>> > Summary:
>> >
>> > We propose creating a new top level project, Apache DataFusion, from
>> > an existing sub project of Apache Arrow to facilitate additional
>> > community and project growth.
>> >
>> > Abstract
>> >
>> > Apache Arrow DataFusion[1]  is a very fast, extensible query engine
>> > for building high-quality data-centric systems in Rust, using the
>> > Apache Arrow in-memory format. DataFusion offers SQL and Dataframe
>> > APIs, excellent performance, built-in support for CSV, Parquet, JSON,
>> > and Avro, extensive customization, and a great community.
>> >
>> > [1] https://arrow.apache.org/datafusion/
>> >
>> >
>> > Proposal
>> >
>> > We propose creating a new top level ASF project, Apache DataFusion,
>> > governed initially by a subset of the Apache Arrow project’s PMC and
>> > committers. The project’s code is in five existing git repositories,
>> > currently governed by Apache Arrow which would transfer to the new top
>> > level project.
>> >
>> > Background
>> >
>> > When DataFusion was initially donated to the Arrow project, it did not
>> > have a strong enough community to stand on its own. It has since grown
>> > significantly, and benefited immensely from 

Re: [VOTE] Move issue tracking to GitHub Issues

2022-10-27 Thread Uwe L. Korn
+1

On Thu, Oct 27, 2022, at 5:13 PM, Nic wrote:
> +1
>
> On Thu, 27 Oct 2022 at 14:00, Alenka Frim 
> wrote:
>
>> +1
>>
>> On Thu, Oct 27, 2022 at 2:36 PM prem sagar gali 
>> wrote:
>>
>> > +1
>> >
>> > On Thu, Oct 27, 2022 at 7:13 AM Dewey Dunnington
>> >  wrote:
>> >
>> > > +1 (non-binding)!
>> > >
>> > > On Thu, Oct 27, 2022 at 8:54 AM Eric Hanson 
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > On 2022/10/26 23:02:33 Neal Richardson wrote:
>> > > > > I propose that we move issue tracking from the ASF's Jira to GitHub
>> > > > Issues.
>> > > > > This has been discussed on [1] and [2] and there seems to be
>> > > consensus. A
>> > > > > number of Arrow subprojects already use GitHub Issues; this moves
>> the
>> > > > issue
>> > > > > tracking for `apache/arrow` into GitHub along with the source code.
>> > > > >
>> > > > > The vote will be open for at least 72 hours.
>> > > > >
>> > > > > [ ] +1 Leave ASF Jira and move to GitHub Issues
>> > > > > [ ] +0
>> > > > > [ ] -1 Remain in Jira because...
>> > > > >
>> > > > > My vote: +1
>> > > > >
>> > > > > Neal
>> > > > >
>> > > > >
>> > > > > [1]:
>> > https://lists.apache.org/thread/l545m95xmf3w47oxwqxvg811or7b93tb
>> > > > > [2]:
>> > https://lists.apache.org/thread/0vwj8gdo55jly5zn16wksrotyqqm0zqr
>> > > > >
>> > > >
>> > >
>> >
>>


Re: [Discuss][Python] Stop publishing universal wheels?

2022-10-27 Thread Uwe L. Korn
Hello,

if we have wheels for x86_64 and arm64 individually, I don't see an argument 
for keeping universal2 ones. x86_64 Macs will probably stay around for a while 
as Apple is quite good in keeping old hardware updated, and the laptops 
themselves are pretty solid.

Best
Uwe

On Thu, Oct 27, 2022, at 10:04 AM, Antoine Pitrou wrote:
> Hello,
>
> Currently, for macOS we're publishing both arm64, x86_64 *and* 
> universal2 binary wheels (the latter contain both arm64 and x86_64 code 
> in a single binary).
>
> Here are some observations from me:
>
> * Producing universal2 wheels is more complex than producing 
> single-architecture wheels (we actually have to build for both 
> architectures separately, then merge the results); it's also one more 
> CI/packaging configuration to look after
>
> * x86-64 Macs are legacy and are gradually disappearing (especially for 
> high-performance applications where ARM Macs are massively faster)
>
> * Numpy publishes only architecture-specific wheels, while Pandas 
> publishes both architecture-specific and universal wheels
>
> * Size-wise, a universal wheel is not much smaller than the sum of 
> architecture-specific wheels (for example, 43.7 MB for
> pyarrow-9.0.0-cp310-cp310, vs. 24.0 MB + 21.6 MB)
>
> Is there any reason why we should continue publishing universal wheels 
> for macOS?
>
> Regards
>
> Antoine.


Re: [VOTE] Release Apache Arrow 7.0.0 - RC6

2022-01-25 Thread Uwe L. Korn
Hello all,

I sadly get an issue with compiling with GCC 7.5 at the moment as reported in 
https://issues.apache.org/jira/browse/ARROW-15444

We need this version to support CUDA-enabled and ppc64le builds on conda-forge.

Cheers
Uwe

On Tue, Jan 25, 2022, at 10:35 AM, Krisztián Szűcs wrote:
> Thanks David for verifying!
>
> On Tue, Jan 25, 2022 at 1:07 AM David Li  wrote:
>>
>> Unfortunately I get a few errors during source verification of Python due to 
>> flaky tests.
>>
>> I filed https://issues.apache.org/jira/browse/ARROW-15437 and 
>> https://issues.apache.org/jira/browse/ARROW-15438
> Luckily these shouldn't be blocker issues.
>>
>> I'll try to have a patch for the former later tonight.
>>
>> Also I had to uninstall gdb before doing verification as the GDB tests fail 
>> for me due to not being able to find libarrow.so.
> Created a jira https://issues.apache.org/jira/browse/ARROW-15442
>
> Meanwhile you can disable the GDB tests by `export PYARROW_TEST_GDB=OFF`
>>
>> -David
>>
>> On Mon, Jan 24, 2022, at 13:02, Krisztián Szűcs wrote:
>> > Hi,
>> >
>> > I would like to propose the following release candidate (RC6) of Apache
>> > Arrow version 7.0.0. This is a release consisting of 608
>> > resolved JIRA issues[1].
>> >
>> > This release candidate is based on commit:
>> > cc809bd98a04f562a38107858cab669db0768cc1 [2]
>> >
>> > The source release rc6 is hosted at [3].
>> > The binary artifacts are hosted at [4][5][6][7][8][9][10][11].
>> > The changelog is located at [12].
>> >
>> > Please download, verify checksums and signatures, run the unit tests,
>> > and vote on the release. See [13] for how to validate a release candidate.
>> >
>> > The vote will be open for at least 72 hours.
>> >
>> > [ ] +1 Release this as Apache Arrow 7.0.0
>> > [ ] +0
>> > [ ] -1 Do not release this as Apache Arrow 7.0.0 because...
>> >
>> > [1]: 
>> > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%207.0.0
>> > [2]: 
>> > https://github.com/apache/arrow/tree/cc809bd98a04f562a38107858cab669db0768cc1
>> > [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-7.0.0-rc6
>> > [4]: https://apache.jfrog.io/artifactory/arrow/almalinux-rc/
>> > [5]: https://apache.jfrog.io/artifactory/arrow/amazon-linux-rc/
>> > [6]: https://apache.jfrog.io/artifactory/arrow/centos-rc/
>> > [7]: https://apache.jfrog.io/artifactory/arrow/debian-rc/
>> > [8]: https://apache.jfrog.io/artifactory/arrow/java-rc/7.0.0-rc6
>> > [9]: https://apache.jfrog.io/artifactory/arrow/nuget-rc/7.0.0-rc6
>> > [10]: https://apache.jfrog.io/artifactory/arrow/python-rc/7.0.0-rc6
>> > [11]: https://apache.jfrog.io/artifactory/arrow/ubuntu-rc/
>> > [12]: 
>> > https://github.com/apache/arrow/blob/cc809bd98a04f562a38107858cab669db0768cc1/CHANGELOG.md
>> > [13]: 
>> > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
>> >


Re: [DISCUSS] Dropping support for Visual Studio 2015

2021-08-14 Thread Uwe L. Korn
+1 

VS2017 should also be compatible with VS2015 so that this should cause any 
issues for downstream users that link dynamically.

> Am 14.08.2021 um 01:56 schrieb Benjamin Kietzman :
> 
> Thanks for commenting, all. I'll open a JIRA/PR to remove support next week.
> 
>> On Tue, Aug 10, 2021, 09:34 Wes McKinney  wrote:
>> 
>> +1 for dropping it also.
>> 
>>> On Mon, Aug 9, 2021 at 7:03 PM Keith Kraus 
>>> wrote:
>>> 
>>> +1 as well. Is there any build platforms that we're currently supporting
>>> that still use vs2015?
>>> 
>>> Conda-forge did its migration ~1.5 years ago:
>>> https://github.com/conda-forge/conda-forge-pinning-feedstock/pull/501.
>>> 
>>> -Keith
>>> 
>>> On Mon, Aug 9, 2021 at 12:01 PM Antoine Pitrou 
>> wrote:
>>> 
 
 +1 for requiring a more recent MSVC version.
 
 Regards
 
 Antoine.
 
 
 Le 09/08/2021 à 17:38, Benjamin Kietzman a écrit :
> MSVC 19.0 is buggy enough that I for one have spent multiple days
> reworking code that is fine for all other compilers we test against.
> Most recently in the context of
 https://github.com/apache/arrow/pull/10793
> (ARROW-13482) I found that for some types T,
> `std::is_convertible::value` will be false. This necessitated
>> the
> following
> (very hacky) workaround:
> 
> 
 
>> https://github.com/apache/arrow/pull/10793/commits/c44be29686af6fab2132097aa3cbd430d6ac71fe
> 
> (Side note: if anybody has a better solution than that specific
 hack,
>  please don't hesitate to comment on the PR.)
> 
> Would it be allowable for us to drop support for this compiler? IIUC
> Microsoft is no longer accepting feedback/bug reports for VS2017, let
> alone VS2015. Are there any users who depend on libarrow building
> with that compiler?
> 
 
>> 



Re: [RESULT] [VOTE] Release Apache Arrow 3.0.0 - RC2

2021-01-27 Thread Uwe L. Korn
1.  [done] rebase master
2.  [done] upload source
3.  [done] upload binaries
4.  [done] update website
5.  [done] upload ruby gems
6.  [done] upload js packages
8.  [done] upload C# packages
9.  [done] upload rust crates
10. [done] update conda recipes
11. [done] upload wheels/sdist to pypi
12. [ ] update homebrew packages
https://github.com/Homebrew/homebrew-core/pull/69125 may
include this.
13. [done] update maven artifacts
14. [done] update msys2
15. [nealrichardson] update R packages
16. [done] update docs
17. [ ] rebase the pull requests

On Wed, Jan 27, 2021, at 12:14 AM, Neal Richardson wrote:
> R package has been submitted to CRAN.
> 
> I've also started a release announcement blog post:
> https://github.com/apache/arrow-site/pull/92
> 
> Please help fill in the sections for the various languages in the project.
> 
> Neal
> 
> 
> On Tue, Jan 26, 2021 at 12:30 AM Sutou Kouhei  wrote:
> 
> > 1.  [done] rebase master
> > 2.  [done] upload source
> > 3.  [done] upload binaries
> > 4.  [done] update website
> > 5.  [done] upload ruby gems
> > 6.  [done] upload js packages
> > 8.  [done] upload C# packages
> > 9.  [done] upload rust crates
> > 10. [xhochy] update conda recipes
> > 11. [done] upload wheels/sdist to pypi
> > 12. [ ] update homebrew packages
> > https://github.com/Homebrew/homebrew-core/pull/69125 may
> > include this.
> > 13. [done] update maven artifacts
> > 14. [done] update msys2
> > 15. [nealrichardson] update R packages
> > 16. [done] update docs
> > 17. [ ] rebase the pull requests
> >
> > In 
> >   "Re: [RESULT] [VOTE] Release Apache Arrow 3.0.0 - RC2" on Tue, 26 Jan
> > 2021 02:15:02 +0100,
> >   Krisztián Szűcs  wrote:
> >
> > > On Tue, Jan 26, 2021 at 1:57 AM Andy Grove 
> > wrote:
> > >>
> > >> The Rust crates (arrow, arrow-flight, parquet, parquet_derive,
> > datafusion)
> > >> have now been published to crates.io
> > > Thanks! Did everything go fluently? We usually have issues with the
> > > versions defined in the cargo files.
> > >
> > > I uploaded the python wheels:
> > > 1.  [done] rebase master
> > > 2.  [done] upload source
> > > 3.  [done] upload binaries
> > > 4.  [done] update website
> > > 5.  [done] upload ruby gems
> > > 6.  [done] upload js packages
> > > 8.  [done] upload C# packages
> > > 9.  [done] upload rust crates
> > > 10. [xhochy] update conda recipes
> > > 11. [done] upload wheels/sdist to pypi
> > > 12. [ ] update homebrew packages
> > > 13. [done] update maven artifacts
> > > 14. [kou] update msys2
> > >>
> > >> On Mon, Jan 25, 2021 at 5:29 PM Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>
> > >> wrote:
> > >>
> > >> > On Tue, Jan 26, 2021 at 1:18 AM Andy Grove 
> > wrote:
> > >> > >
> > >> > > Thanks, Krisztián.
> > >> > >
> > >> > > I can release the Rust crates this evening (in the next hour or so).
> > >> > Thank you Andy!
> > >> >
> > >> > The current status:
> > >> > 1.  [done] rebase master
> > >> > 2.  [done] upload source
> > >> > 3.  [done] upload binaries
> > >> > 4.  [done] update website
> > >> > 5.  [done] upload ruby gems
> > >> > 6.  [done] upload js packages
> > >> > 8.  [done] upload C# packages
> > >> > 9.  [andygrove] upload rust crates
> > >> > 10. [xhochy] update conda recipes
> > >> > 11. [kszucs] upload wheels/sdist to pypi
> > >> > 12. [ ] update homebrew packages
> > >> > 13. [done] update maven artifacts
> > >> > 14. [kou] update msys2
> > >> > 15. [nealrichardson] update R packages
> > >> > 16. [ ] update docs
> > >> > 17. [ ] rebase the pull requests
> > >> >
> > >> > >
> > >> > > On Mon, Jan 25, 2021 at 4:17 PM Krisztián Szűcs <
> > >> > szucs.kriszt...@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Current status of the post release tasks:
> > >> > > >
> > >> > > > 1.  [done] rebase master
> > >> > > > 2.  [done] upload source
> > >> > > > 3.  [done] upload binaries
> > >> > > > 4.  [ ] update website
> > >> > > > 5.  [done] upload ruby gems
> > >> > > > 6.  [ ] upload js packages
> > >> > > > 8.  [done] upload C# packages
> > >> > > > 9.  [ ] upload rust crates
> > >> > > > 10. [ ] update conda recipes
> > >> > > > 11. [kszucs] upload wheels/sdist to pypi
> > >> > > > 12. [ ] update homebrew packages
> > >> > > > 13. [kszucs] update maven artifacts
> > >> > > > 14. [ ] update msys2
> > >> > > > 15. [nealrichardson] update R packages
> > >> > > > 16. [ ] update docs
> > >> > > > 17. [ ] rebase the pull requests
> > >> > > >
> > >> > > > On Mon, Jan 25, 2021 at 10:38 PM Wes McKinney <
> > wesmck...@gmail.com>
> > >> > wrote:
> > >> > > > >
> > >> > > > > Copying the vote result with the usual subject line
> > >> > > > >
> > >> > > > > On Mon, Jan 25, 2021 at 3:23 PM Krisztián Szűcs
> > >> > > > >  wrote:
> > >> > > > > >
> > >> > > > > > The VOTE carries with
> > >> > > > > > - 4 binding +1
> > >> > > > > > - 3 non-binding +1
> > >> > > > > > - 1 non-binding +0
> > >> > > > > > votes.
> > >> > > > > >
> > >> > > > > > Thanks everyone!
> > >> > > > > >
> > >> > > > > > I'm starting the post release 

Re: [VOTE] Release Apache Arrow 3.0.0 - RC2

2021-01-25 Thread Uwe L. Korn
+1 (binding)

Verified C++, Python and Rust on the Apple M1 (natively!) and all works. I had 
to do some slight modifications to the verification script but they are 
independent of the source tarball: https://github.com/apache/arrow/pull/9315

Cheers
Uwe

On Fri, Jan 22, 2021, at 4:59 PM, Neal Richardson wrote:
> Crossbow does (at least some) Windows verification, and it appears that it
> passed: https://github.com/apache/arrow/pull/9245
> 
> On Fri, Jan 22, 2021 at 5:46 AM Neville Dipale 
> wrote:
> 
> > Hi Krisztian,
> >
> > The full output is at
> > https://gist.github.com/nevi-me/88a6279dd90aea30aa4caaa15fb0cc53
> >
> > I also ran dev/release/verify-release-candidate-wheels.bat 3.0.0 2
> >
> > Getting the below error, it seems to be a Python 3.7 bug, but I'm not yet
> > finding a solution for it online.
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>pip install
> > pyarrow-3.0.0-cp36-cp36m-win_amd64.whl   || EXIT /B 1
> > Failed to import the site module
> > Traceback (most recent call last):
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 579, in 
> > main()
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 566, in main
> > known_paths = addsitepackages(known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 349, in addsitepackages
> > addsitedir(sitedir, known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 207, in addsitedir
> > addpackage(sitedir, name, known_paths)
> >   File "C:\Users\nevi\anaconda3\lib\site.py", line 159, in addpackage
> > f = open(fullname, "r")
> >   File "C:\Users\nevi\anaconda3\lib\_bootlocale.py", line 12, in
> > getpreferredencoding
> > if sys.flags.utf8_mode:
> > AttributeError: 'sys.flags' object has no attribute 'utf8_mode'
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>if errorlevel 1 GOTO error
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>call deactivate
> > DeprecationWarning: 'deactivate' is deprecated. Use 'conda deactivate'.
> >
> > (C:\tmp\arrow-verify-release-wheels\_verify-wheel-3.6)
> > C:\tmp\arrow-verify-release-wheels>conda.bat deactivate
> >
> > C:\tmp\arrow-verify-release-wheels>cd C:\Users\nevi\Work\oss\arrow
> >
> > C:\Users\nevi\Work\oss\arrow>EXIT /B 1
> >
> > On Fri, 22 Jan 2021 at 15:14, Krisztián Szűcs 
> > wrote:
> >
> > > Thanks Neville for testing it!
> > >
> > > There should be more context about the failures above the summary.
> > > Could you please post the errors?
> > >
> > >
> > > On Fri, Jan 22, 2021 at 2:05 PM Neville Dipale 
> > > wrote:
> > > >
> > > > (+0 non-binding)
> > > >
> > > > Getting test failures (see end of my mail).
> > > >
> > > > This is my first time verifying (Windows 10; Insider Preview if
> > > relevant),
> > > > so I'm
> > > > likely missing something in my config. I'll read the verification
> > script
> > > > and try again.
> > > >
> > > > I ran the below using PowerShell:
> > > >
> > > > $env:ARROW_GANDIVA=0; $env:ARROW_PLASMA=0; $env:TEST_DEFAULT=0;
> > > > $env:TEST_SOURCE=1; $env:TEST_CPP=1; $env:TEST_PYTHON=1;
> > > > $env:TEST_JAVA=1; $env:TEST_INTEGRATION_CPP=1;
> > > > $env:TEST_INTEGRATION_JAVA=1;
> > > > ./dev/release/verify-release-candidate.bat 3.0.0 2
> > > >
> > > > I had to change cmake generator to use the below (in line 53):
> > > >
> > > > set GENERATOR=Visual Studio 16 2019
> > > >
> > > > VS 2017 wasn't working for me, even after installing its build tools.
> > > >
> > > > 
> > > >
> > > > I have 3 test failures per below:
> > > >
> > > > 94% tests passed, 3 tests failed out of 52
> > > >
> > > > Label Time Summary:
> > > > arrow-tests   =  66.55 sec*proc (30 tests)
> > > > arrow_compute =   3.38 sec*proc (4 tests)
> > > > arrow_dataset =   0.89 sec*proc (9 tests)
> > > > arrow_flight  =   0.92 sec*proc (1 test)
> > > > arrow_python-tests=   0.45 sec*proc (1 test)
> > > > filesystem=   6.36 sec*proc (2 tests)
> > > > parquet-tests =   7.22 sec*proc (7 tests)
> > > > unittest  =  79.41 sec*proc (52 tests)
> > > >
> > > > Total Test time (real) =  80.27 sec
> > > >
> > > > The following tests FAILED:
> > > >  45 - arrow-python-test (Failed)
> > > >  46 - parquet-internals-test (Failed)
> > > >  49 - parquet-arrow-test (Failed)
> > > > Errors while running CTest
> > > >
> > > > I don't know if this has any significance, I got the errors from 2
> > runs.
> > > >
> > > > Neville
> > > >
> > > > On Fri, 22 Jan 2021 at 14:22, Neville Dipale 
> > > wrote:
> > > >
> > > > > This is my first time verifying, do I also need to set the env vars
> > > below?
> > > > >
> > > > > ARROW_GANDIVA=0 ARROW_PLASMA=0 TEST_DEFAULT=0
> > > > > TEST_SOURCE=1 TEST_CPP=1 TEST_PYTHON=1 TEST_JAVA=1
> > > > > TEST_INTEGRATION_CPP=1 TEST_INTEGRATION_JAVA=1
> > > > >
> > > > > Otherwise, I'm currently 

Re: Incompatability of all existing pyarrow releases with the next NumPy release

2020-12-04 Thread Uwe L. Korn
NumPy's deprecation policy would drop support for the 1.16 series in January: 
https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table This when 
I would suggest to up the minimal numpy in builds here to 1.17 and we will also 
up the version used for builds in conda-forge.

Still, the PR is so trival that we should merge it. I'm not uptodate what the 
status of the 2.0.1 release is but this would be an essential patch for that.

On Fri, Dec 4, 2020, at 9:22 PM, Antoine Pitrou wrote:
> 
> 
> Le 04/12/2020 à 21:11, Uwe L. Korn a écrit :
> > Hello all,
> > 
> > Today the Karotothek CI turned quite red in 
> > https://github.com/JDASoftwareGroup/kartothek/pull/383 / 
> > https://github.com/JDASoftwareGroup/kartothek/pull/383/checks?check_run_id=1497941813
> >  as the new NumPy 1.20rc1 was pulled in. It simply broke all 
> > pyarrow<->NumPy interop as now dtypes returned by numpy are actual 
> > subclasses not directly numpy.dtype instances anymore. I reported the issue 
> > over at https://github.com/numpy/numpy/issues/17913. We are running into 
> > that as we build our wheels and conda packages with an older release of 
> > NumPy that has a faulty implementation of PyArray_DescrCheck.
> > 
> >  (a) For upcoming releases, we can either move our minimal supported NumPy 
> > to 1.16.6 or merge the PR over at https://github.com/apache/arrow/pull/8834
> 
> I would be fine with merging the PR (assuming comments are added to
> explain why things are done that way).  Apparently Numpy 1.16.6 is only
> one year old.
> 
> Regards
> 
> Antoine.
>


Incompatability of all existing pyarrow releases with the next NumPy release

2020-12-04 Thread Uwe L. Korn
Hello all,

Today the Karotothek CI turned quite red in 
https://github.com/JDASoftwareGroup/kartothek/pull/383 / 
https://github.com/JDASoftwareGroup/kartothek/pull/383/checks?check_run_id=1497941813
 as the new NumPy 1.20rc1 was pulled in. It simply broke all pyarrow<->NumPy 
interop as now dtypes returned by numpy are actual subclasses not directly 
numpy.dtype instances anymore. I reported the issue over at 
https://github.com/numpy/numpy/issues/17913. We are running into that as we 
build our wheels and conda packages with an older release of NumPy that has a 
faulty implementation of PyArray_DescrCheck.

 (a) For upcoming releases, we can either move our minimal supported NumPy to 
1.16.6 or merge the PR over at https://github.com/apache/arrow/pull/8834
 (b) Existing conda(-forge) packages can get a repodata patch that adds a 
numpy<1.20 constraint to them
 (c) I'll rebuild the latest but still frequently used pyarrow releases on 
conda-forge using numpy 1.16.6
 (d) Old pyarrow wheels (Python<3.8) though won't be easily fixed and instead 
will return the confusing "ArrowTypeError: Did not pass numpy.dtype object" 
error message. Personally my approach would be here to not do anything and 
simply direct users to downgrade NumPy if they run into the issue.

Is anyone objecting to this approach?

Cheers
Uwe


Re: Removing Python 3.5 support

2020-11-26 Thread Uwe L. Korn
+1 from my side too

On Thu, Nov 26, 2020, at 1:04 PM, Joris Van den Bossche wrote:
> +1 on dropping Python 3.5
> 
> On Thu, 26 Nov 2020 at 12:26, Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > Python 3.5 is not supported upstream, neither by the CPython development
> > team nor by third-party projects such as NumPy.
> >
> > Our CI for Python 3.5 has just started failing because a dependency of
> > pytest doesn't support it anymore:
> > https://github.com/apache/arrow/runs/1458449054?check_suite_focus=true
> >
> > Unless someone objects very strongly (and pledges to provide the
> > required maintenance effort), it seems we finally need to drop Python
> > 3.5 support before we release Arrow 3.0.0.
> >
> > See also https://issues.apache.org/jira/browse/ARROW-5679
> >
> > Regards
> >
> > Antoine.
> >
>


Re: ursa-labs/crossbow on travis-ci.com is disabled

2020-11-26 Thread Uwe L. Korn
Also note that drone.io supports linux-arm64 which we use in conda-forge for 
this architecture and is already setup in crossbow (although we had issues with 
branches not being seen).

On Thu, Nov 26, 2020, at 1:31 AM, Jeroen Ooms wrote:
> On Wed, Nov 25, 2020 at 10:54 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > In 
> >   "Re: ursa-labs/crossbow on travis-ci.com is disabled" on Tue, 24 Nov 2020 
> > 13:36:54 +0100,
> >   Krisztián Szűcs  wrote:
> >
> > > Confirmed, we already have a negative credit balance due to travis'
> > > new billing strategy.
> > > The macos wheels quickly consume the credit based free tier, so travis
> > > disables even the linux builds.
> > >
> > > I think we should migrate away from travis to gha or azure, drawbacks:
> > > - the wheel scripts are tailored for travis
> > > - only amd64 arch
> >
> > Thanks for confirming it.
> >
> > It seems that we have 10,000 credits per month. Our
> > travis-ci.com jobs will be enabled again after 4 days. If we
> > reduce our travis-ci.com usage, we may be able to keep using
> > travis.ci.com.
> 
> I found that if you contact their support, they may be willing to
> donate much more free credits for selected open source projects. They
> already mention Apache (the httpd) as an example in their blog of
> important OSS: https://blog.travis-ci.com/oss-announcement
>


Re: [Governance] [Proposal] Stop force-pushing to PRs after release?

2020-11-25 Thread Uwe L. Korn
Hello Jorge,

I know from the past on the Python/C++ side, we needed to do this for a lot of 
contributors to enable them to work with their branches/PRs again as they were 
overwhelmed with the complexity of these rebases. Personally, I wouldn't like 
to spend much time on whether we should rebase PRs after the release for 
everyone or not but actually get rid of the need to push to master to actually 
get a release candidate out. This is making the work of the release manager 
harder, confuses downstream packagers and also leads to the fact that all PRs 
are diverged when they were touched during the release process. 

The main headache here is that currently the release tooling on the Java side 
requires us to do this. I know that in the last days someone opened a JIRA to 
get rid of that (and hopefully someone will link to that JIRA in this thread). 
Solving that would be a win for all and also make this discussion unnecessary. 
The main caveat is that the annoyance on the Java side pops up mostly with the 
non-Java devs and thus it was not solved yet.

Cheers
Uwe

On Wed, Nov 25, 2020, at 5:26 AM, Jorge Cardoso Leitão wrote:
> Hi,
> 
> Based on a discussion on PR #8481, I would like to raise a concern around
> git and the post-actions of a release. The background is that I was really
> confused that someone has force-pushed to a PR that I fielded, re-writing
> its history and causing the PR to break.
> 
> @wes and @kszucs quickly explained to me that this is a normal practice in
> this project on every release, to which I was a bit astonished.
> 
> AFAIK, in open source, there is a strong expectation that PRs are managed
> by individual contributors, and committers of the project only request
> contributors to make changes to them, or kindly ask before pushing (not
> force-pushing) directly to the PR.
> 
> IMO, by force-pushing to PRs, we are inverting all expectations and
> sometimes even breaking PRs without consent from the contributor. This
> drives any reasonable contributor to be pissed off at the team for what we
> just did after a release:
> 
>- force-pushed to master
>- force-pushed to their PRs
>- broke their PRs's CI
>- no prior notice or request of any of the above
> 
> IMO this is confusing and goes against what virtually every open source
> project does. This process also puts a lot of strain in our CIs when we
> have an average of 100 open PRs and force-push to all of them at once.
> 
> As such, I would like to propose a small change in the post-release process
> and to the development practices more generally:
> 
>1. stop force-pushing to others' PRs
>2. stop pushing to others' PRs without their explicit consent
>3. document in the contributing guidelines
>
>that master is force-pushed on every release, and the steps that
>contributors need to take to bring their PRs to the latest master
> 
> The underlying principles here are:
> 
>- it is the contributor's responsibility to keep the PRs in a "ready to
>merge" state, rebasing them to master as master changes.
>- A force-push to master corresponds to a change in master
>- thus it is the contributor's responsibility to rebase their PRs
>against the new master.
> 
> I understand the argument that it is a burden for the contributors to keep
> PRs up-to-date. However, I do not think that this justifies breaking one of
> the most basic assumptions that a contributor has on an open source
> project. Furthermore, they already have to do it anyways whenever the
> master changes with breaking changes: the contributor's process is already
> "git fetch upstream && git rebase upstream/master" whenever master changes.
> Whether it changes due to a normal push or a force-push does not really
> affect this burden when compared to when a merge conflict emerges.
> 
> Any thoughts?
> Best,
> Jorge
>


Re: Development with C++ and Cython APIs in Arrow

2020-11-06 Thread Uwe L. Korn
The pip package (explicitly the wheels) should contain the C++ libraries and 
headers. So it should be sufficient for your usecase and there shouldn't be a 
need for separately building the C++ artifacts.

On Fri, Nov 6, 2020, at 5:18 PM, Vibhatha Abeykoon wrote:
> One more question about packaging, here when the API requires both Cython
> and C++ APIs,
> Pyarrow dependency must also be built from the source? Or is it practical
> to use the same version
> of Arrow using Pip?
> 
> With Regards,
> Vibhatha Abeykoon
> 
> 
> On Fri, Nov 6, 2020 at 9:59 AM Vibhatha Abeykoon  wrote:
> 
> > Hello Uwe,
> >
> > Nice example. I will follow this.
> >
> > With Regards,
> > Vibhatha Abeykoon
> >
> >
> > On Fri, Nov 6, 2020 at 9:36 AM Uwe L. Korn  wrote:
> >
> >> Hello Vibhatha,
> >>
> >> the best is to set a relative RPATH on the libraries. An example for this
> >> can be seen in the turbodbc sources:
> >> https://github.com/blue-yonder/turbodbc/blob/80a29a7edfbdabf12410af01c0c0ae74bfc3aab4/setup.py#L186-L189
> >>
> >> Cheers
> >> Uwe
> >>
> >> On Tue, Nov 3, 2020, at 11:44 PM, Vibhatha Abeykoon wrote:
> >> > Hello,
> >> >
> >> > I have a question related to packaging an API written by using both C++
> >> API
> >> > and Cython API of Arrow.
> >> >
> >> > For now what I do is, build Arrow from source to generate both
> >> libarrow.so
> >> > and libarrow_python.so. When using the library, I have to point the
> >> > installed *.so using the LD_LIBRARY_PATH. But when packaging the
> >> project, I
> >> > am not quite sure whether this is the correct approach. For instance,
> >> when
> >> > generating a pip package, this workflow is not a good solution.
> >> >
> >> > Any comments and suggestions?
> >> >
> >> > With Regards,
> >> > Vibhatha Abeykoon,
> >> >
> >>
> >
>


Re: Development with C++ and Cython APIs in Arrow

2020-11-06 Thread Uwe L. Korn
Hello Vibhatha,

the best is to set a relative RPATH on the libraries. An example for this can 
be seen in the turbodbc sources: 
https://github.com/blue-yonder/turbodbc/blob/80a29a7edfbdabf12410af01c0c0ae74bfc3aab4/setup.py#L186-L189

Cheers
Uwe

On Tue, Nov 3, 2020, at 11:44 PM, Vibhatha Abeykoon wrote:
> Hello,
> 
> I have a question related to packaging an API written by using both C++ API
> and Cython API of Arrow.
> 
> For now what I do is, build Arrow from source to generate both libarrow.so
> and libarrow_python.so. When using the library, I have to point the
> installed *.so using the LD_LIBRARY_PATH. But when packaging the project, I
> am not quite sure whether this is the correct approach. For instance, when
> generating a pip package, this workflow is not a good solution.
> 
> Any comments and suggestions?
> 
> With Regards,
> Vibhatha Abeykoon,
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-11-05-0

2020-11-05 Thread Uwe L. Korn
Taking care of the failing conda-win jobs in 
https://issues.apache.org/jira/browse/ARROW-10502

On Thu, Nov 5, 2020, at 11:14 AM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-11-05-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0
> 
> Failed Tasks:
> - conda-win-vs2017-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-win-vs2017-py36
> - conda-win-vs2017-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-win-vs2017-py37
> - conda-win-vs2017-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-win-vs2017-py38
> - test-conda-python-3.7-spark-branch-3.0:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-test-conda-python-3.7-spark-branch-3.0
> - test-conda-python-3.8-jpype:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-test-conda-python-3.8-jpype
> - test-ubuntu-18.04-docs:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-test-ubuntu-18.04-docs
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-centos-8-amd64
> - conda-clean:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-clean
> - conda-linux-gcc-py36-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-drone-conda-linux-gcc-py36-aarch64
> - conda-linux-gcc-py36-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py36-cpu
> - conda-linux-gcc-py36-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py36-cuda
> - conda-linux-gcc-py37-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-drone-conda-linux-gcc-py37-aarch64
> - conda-linux-gcc-py37-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py37-cpu
> - conda-linux-gcc-py37-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py37-cuda
> - conda-linux-gcc-py38-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-drone-conda-linux-gcc-py38-aarch64
> - conda-linux-gcc-py38-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py38-cpu
> - conda-linux-gcc-py38-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-linux-gcc-py38-cuda
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-azure-conda-osx-clang-py38
> - debian-buster-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-debian-buster-amd64
> - debian-buster-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-travis-debian-buster-arm64
> - debian-stretch-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-debian-stretch-amd64
> - debian-stretch-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-travis-debian-stretch-arm64
> - example-cpp-minimal-build-static-system-dependency:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-example-cpp-minimal-build-static-system-dependency
> - example-cpp-minimal-build-static:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-github-example-cpp-minimal-build-static
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-11-05-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL: 
> 

Re: [VOTE] Release Apache Arrow 2.0.0 - RC2

2020-10-21 Thread Uwe L. Korn
tatus of post release tasks:
> > > >> > > >
> > > >> > > > 1.  [done] rebase master
> > > >> > > > 2.  [done] upload source
> > > >> > > > 3.  [done] upload binaries
> > > >> > > > 4.  [kszucs] update website
> > > >> > > > 5.  [done] upload ruby gems
> > > >> > > > 6.  [ ] upload js packages
> > > >> > > > 8.  [done] upload C# packages
> > > >> > > > 9.  [ ] upload rust crates
> > > >> > > > 10. [ ] update conda recipes
> > > >> > > > 11. [kszucs] upload wheels to pypi
> > > >> > > > 12. [ ] update homebrew packages
> > > >> > > > 13. [done] update maven artifacts
> > > >> > > > 14. [ ] update msys2
> > > >> > > > 15. [nealrichardson] update R packages
> > > >> > > > 16. [kszucs] update docs
> > > >> > > > 17. [kszucs] rebase the pull requests
> > > >> > > > 18. [done] set jira version as released
> > > >> > > > 19. [done] set release data at reporter.apache.org
> > > >> > > >
> > > >> > > > On Mon, Oct 19, 2020 at 7:47 PM Krisztián Szűcs
> > > >> > > >  wrote:
> > > >> > > > >
> > > >> > > > > The VOTE carries with 3 binding +1 votes and 1 non-binding +1
> > > vote
> > > >> > and
> > > >> > > > > 1 non-binding +0 vote.
> > > >> > > > >
> > > >> > > > > I'm starting the post release tasks and keep you posted about
> > > the
> > > >> > > > > remaining tasks.
> > > >> > > > >
> > > >> > > > > Thanks everyone!
> > > >> > > > >
> > > >> > > > > On Mon, Oct 19, 2020 at 7:45 PM Uwe L. Korn 
> > > >> > wrote:
> > > >> > > > > >
> > > >> > > > > > +0 from my side, I see no big issues. I was able to verify 
> > > >> > > > > > the
> > > >> > wheels,
> > > >> > > > the source verification fails due to the llvm package issues on
> > > brew;
> > > >> > thus
> > > >> > > > I'm not able to +1 this time.
> > > >> > > > > >
> > > >> > > > > > Uwe
> > > >> > > > > >
> > > >> > > > > > On Mon, Oct 19, 2020, at 7:38 PM, Krisztián Szűcs wrote:
> > > >> > > > > > > On Mon, Oct 19, 2020 at 5:32 PM Uwe L. Korn <
> > > uw...@xhochy.com>
> > > >> > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Mon, Oct 19, 2020, at 5:07 PM, Neal Richardson wrote:
> > > >> > > > > > > > > I wouldn't expect the default S3 region to depend on
> > > locale.
> > > >> > It
> > > >> > > > does
> > > >> > > > > > > > > depend
> > > >> > > > > > > > > on aws-sdk-cpp version, as we saw in ARROW-10066; see
> > > >> > > > > > > > >
> > > >> > > >
> > > >> >
> > > https://aws.amazon.com/blogs/developer/aws-sdk-for-c-version-1-8-developer-preview/
> > > >> > > > > > > > > for how the new version determines the default version.
> > > >> > > > > > > >
> > > >> > > > > > > > Already worked around that thanks to Antoine!
> > > >> > > > > > > >
> > > >> > > > > > > > > And a change in Homebrew policy isn't grounds for
> > > rejecting
> > > >> > this
> > > >> > > > release,
> > > >> > > > > > > > > is it?
> > > >> > > > > > > >
> > > >> > > > > > > > No but it means that I cannot test the source on OSX with
> > > the
> > > >> >

Re: [VOTE] Release Apache Arrow 2.0.0 - RC2

2020-10-19 Thread Uwe L. Korn
+0 from my side, I see no big issues. I was able to verify the wheels, the 
source verification fails due to the llvm package issues on brew; thus I'm not 
able to +1 this time.

Uwe

On Mon, Oct 19, 2020, at 7:38 PM, Krisztián Szűcs wrote:
> On Mon, Oct 19, 2020 at 5:32 PM Uwe L. Korn  wrote:
> >
> >
> >
> > On Mon, Oct 19, 2020, at 5:07 PM, Neal Richardson wrote:
> > > I wouldn't expect the default S3 region to depend on locale. It does
> > > depend
> > > on aws-sdk-cpp version, as we saw in ARROW-10066; see
> > > https://aws.amazon.com/blogs/developer/aws-sdk-for-c-version-1-8-developer-preview/
> > > for how the new version determines the default version.
> >
> > Already worked around that thanks to Antoine!
> >
> > > And a change in Homebrew policy isn't grounds for rejecting this release,
> > > is it?
> >
> > No but it means that I cannot test the source on OSX with the given 
> > scripts. Someone needs to address this with Homebrew though and re-add the 
> > llvm@10 package. This isn't a change in policy and I guess that it may 
> > suffice to add the new Arrow release to homebrew to get llvm@10 re-added.
> We can address this during the post release tasks.
> >
> > Uwe
> >
> > >
> > > Neal
> > >
> > > On Mon, Oct 19, 2020 at 6:04 AM Uwe L. Korn  wrote:
> > >
> > > > Trying to verify on macOS but run into the following two issues:
> > > >
> > > > * The default S3 region is „eu-central-1“ for me despite setting LANG=C
> > > > * llvm@10 is not available for homebrew anymore, see also
> > > > https://github.com/Homebrew/homebrew-core/pull/62798#issuecomment-711606370
> > > > <
> > > > https://github.com/Homebrew/homebrew-core/pull/62798#issuecomment-711606370
> > > > >
> > > >
> > > > > Am 16.10.2020 um 15:16 schrieb Krisztián Szűcs <
> > > > szucs.kriszt...@gmail.com>:
> > > > >
> > > > > +1 (binding)
> > > > >
> > > > > Verified source, binary and wheel artifacts on macOS Catalina.
> > > > > Also executed the automatized crossbow verification tasks [1].
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/8479#issuecomment-709965167
> > > > >
> > > > > On Fri, Oct 16, 2020 at 4:30 AM Sutou Kouhei  
> > > > > wrote:
> > > > >>
> > > > >>>  * Python 3.8 wheel's test was failed. It also failed with
> > > > >>>1.0.0 and 1.0.1.
> > > > >>>
> > > > >>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
> > > > >>
> > > > >> Fix: https://github.com/apache/arrow/pull/8477
> > > > >>
> > > > >> In <20201015.060219.286366408805956805@clear-code.com>
> > > > >>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Thu, 15 Oct 2020
> > > > 06:02:19 +0900 (JST),
> > > > >>  Sutou Kouhei  wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I forgot to mention some notes:
> > > > >>>
> > > > >>>  * JavaScript test was failed with system Node.js
> > > > >>>v12.18.4. (INSTALL_NODE=0
> > > > >>>dev/release/verify-release-candidate.sh source)
> > > > >>>
> > > > >>>It works without INSTALL_NODE=0. (Node.js is installed
> > > > >>>by nvm.)
> > > > >>>
> > > > >>>  * Python 3.8 wheel's test was failed. It also failed with
> > > > >>>1.0.0 and 1.0.1.
> > > > >>>
> > > > >>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
> > > > >>>
> > > > >>>
> > > > >>> Thanks,
> > > > >>> --
> > > > >>> kou
> > > > >>>
> > > > >>> In <20201014.122838.1320330358136283691@clear-code.com>
> > > > >>>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Wed, 14 Oct 2020
> > > > 12:28:38 +0900 (JST),
> > > > >>>  Sutou Kouhei  wrote:
> > > > >>>
> > > > >>>> +1 (binding)
> > > > >>>>
> > > > >>>> I ran the follow

Re: [VOTE] Release Apache Arrow 2.0.0 - RC2

2020-10-19 Thread Uwe L. Korn



On Mon, Oct 19, 2020, at 5:07 PM, Neal Richardson wrote:
> I wouldn't expect the default S3 region to depend on locale. It does 
> depend
> on aws-sdk-cpp version, as we saw in ARROW-10066; see
> https://aws.amazon.com/blogs/developer/aws-sdk-for-c-version-1-8-developer-preview/
> for how the new version determines the default version.

Already worked around that thanks to Antoine!

> And a change in Homebrew policy isn't grounds for rejecting this release,
> is it?

No but it means that I cannot test the source on OSX with the given scripts. 
Someone needs to address this with Homebrew though and re-add the llvm@10 
package. This isn't a change in policy and I guess that it may suffice to add 
the new Arrow release to homebrew to get llvm@10 re-added.

Uwe

> 
> Neal
> 
> On Mon, Oct 19, 2020 at 6:04 AM Uwe L. Korn  wrote:
> 
> > Trying to verify on macOS but run into the following two issues:
> >
> > * The default S3 region is „eu-central-1“ for me despite setting LANG=C
> > * llvm@10 is not available for homebrew anymore, see also
> > https://github.com/Homebrew/homebrew-core/pull/62798#issuecomment-711606370
> > <
> > https://github.com/Homebrew/homebrew-core/pull/62798#issuecomment-711606370
> > >
> >
> > > Am 16.10.2020 um 15:16 schrieb Krisztián Szűcs <
> > szucs.kriszt...@gmail.com>:
> > >
> > > +1 (binding)
> > >
> > > Verified source, binary and wheel artifacts on macOS Catalina.
> > > Also executed the automatized crossbow verification tasks [1].
> > >
> > > [1]: https://github.com/apache/arrow/pull/8479#issuecomment-709965167
> > >
> > > On Fri, Oct 16, 2020 at 4:30 AM Sutou Kouhei  wrote:
> > >>
> > >>>  * Python 3.8 wheel's test was failed. It also failed with
> > >>>1.0.0 and 1.0.1.
> > >>>
> > >>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
> > >>
> > >> Fix: https://github.com/apache/arrow/pull/8477
> > >>
> > >> In <20201015.060219.286366408805956805@clear-code.com>
> > >>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Thu, 15 Oct 2020
> > 06:02:19 +0900 (JST),
> > >>  Sutou Kouhei  wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I forgot to mention some notes:
> > >>>
> > >>>  * JavaScript test was failed with system Node.js
> > >>>v12.18.4. (INSTALL_NODE=0
> > >>>dev/release/verify-release-candidate.sh source)
> > >>>
> > >>>It works without INSTALL_NODE=0. (Node.js is installed
> > >>>by nvm.)
> > >>>
> > >>>  * Python 3.8 wheel's test was failed. It also failed with
> > >>>1.0.0 and 1.0.1.
> > >>>
> > >>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
> > >>>
> > >>>
> > >>> Thanks,
> > >>> --
> > >>> kou
> > >>>
> > >>> In <20201014.122838.1320330358136283691@clear-code.com>
> > >>>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Wed, 14 Oct 2020
> > 12:28:38 +0900 (JST),
> > >>>  Sutou Kouhei  wrote:
> > >>>
> > >>>> +1 (binding)
> > >>>>
> > >>>> I ran the followings on Debian GNU/Linux sid:
> > >>>>
> > >>>>  * TZ=UTC \
> > >>>>  ARROW_CMAKE_OPTIONS="-DgRPC_SOURCE=BUNDLED
> > -DBoost_NO_BOOST_CMAKE=ON" \
> > >>>>  CUDA_TOOLKIT_ROOT=/usr \
> > >>>>  dev/release/verify-release-candidate.sh source 2.0.0 2
> > >>>>  * dev/release/verify-release-candidate.sh binaries 2.0.0 2
> > >>>>  * LANG=C dev/release/verify-release-candidate.sh wheels 2.0.0 2
> > >>>>
> > >>>> with:
> > >>>>
> > >>>>  * gcc (Debian 10.2.0-9) 10.2.0
> > >>>>  * openjdk version "11.0.8" 2020-07-14
> > >>>>  * nvidia-cuda-dev 10.2.89-4
> > >>>>
> > >>>> Thanks,
> > >>>> --
> > >>>> kou
> > >>>>
> > >>>>
> > >>>> In <
> > cahm19a4rc5-jbc_0b7ffxca3jjplfokbwznfwct5_vuobew...@mail.gmail.com>
> > >>>>  "[VOTE] Release Apache Arrow 2.0.0 - RC2" on Tue, 13 Oct 2020
> > 17:41:00 +0200,
> > >&g

Re: [VOTE] Release Apache Arrow 2.0.0 - RC2

2020-10-19 Thread Uwe L. Korn
Trying to verify on macOS but run into the following two issues:

* The default S3 region is „eu-central-1“ for me despite setting LANG=C
* llvm@10 is not available for homebrew anymore, see also 
https://github.com/Homebrew/homebrew-core/pull/62798#issuecomment-711606370 


> Am 16.10.2020 um 15:16 schrieb Krisztián Szűcs :
> 
> +1 (binding)
> 
> Verified source, binary and wheel artifacts on macOS Catalina.
> Also executed the automatized crossbow verification tasks [1].
> 
> [1]: https://github.com/apache/arrow/pull/8479#issuecomment-709965167
> 
> On Fri, Oct 16, 2020 at 4:30 AM Sutou Kouhei  wrote:
>> 
>>>  * Python 3.8 wheel's test was failed. It also failed with
>>>1.0.0 and 1.0.1.
>>> 
>>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
>> 
>> Fix: https://github.com/apache/arrow/pull/8477
>> 
>> In <20201015.060219.286366408805956805@clear-code.com>
>>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Thu, 15 Oct 2020 06:02:19 
>> +0900 (JST),
>>  Sutou Kouhei  wrote:
>> 
>>> Hi,
>>> 
>>> I forgot to mention some notes:
>>> 
>>>  * JavaScript test was failed with system Node.js
>>>v12.18.4. (INSTALL_NODE=0
>>>dev/release/verify-release-candidate.sh source)
>>> 
>>>It works without INSTALL_NODE=0. (Node.js is installed
>>>by nvm.)
>>> 
>>>  * Python 3.8 wheel's test was failed. It also failed with
>>>1.0.0 and 1.0.1.
>>> 
>>>https://gist.github.com/kou/62cae5dcf4dcdd8f044fd33c50e8a007
>>> 
>>> 
>>> Thanks,
>>> --
>>> kou
>>> 
>>> In <20201014.122838.1320330358136283691@clear-code.com>
>>>  "Re: [VOTE] Release Apache Arrow 2.0.0 - RC2" on Wed, 14 Oct 2020 12:28:38 
>>> +0900 (JST),
>>>  Sutou Kouhei  wrote:
>>> 
 +1 (binding)
 
 I ran the followings on Debian GNU/Linux sid:
 
  * TZ=UTC \
  ARROW_CMAKE_OPTIONS="-DgRPC_SOURCE=BUNDLED -DBoost_NO_BOOST_CMAKE=ON" 
 \
  CUDA_TOOLKIT_ROOT=/usr \
  dev/release/verify-release-candidate.sh source 2.0.0 2
  * dev/release/verify-release-candidate.sh binaries 2.0.0 2
  * LANG=C dev/release/verify-release-candidate.sh wheels 2.0.0 2
 
 with:
 
  * gcc (Debian 10.2.0-9) 10.2.0
  * openjdk version "11.0.8" 2020-07-14
  * nvidia-cuda-dev 10.2.89-4
 
 Thanks,
 --
 kou
 
 
 In 
  "[VOTE] Release Apache Arrow 2.0.0 - RC2" on Tue, 13 Oct 2020 17:41:00 
 +0200,
  Krisztián Szűcs  wrote:
 
> Hi,
> 
> I would like to propose the following release candidate (RC2*) of Apache
> Arrow version 2.0.0. This is a release consisting of 561
> resolved JIRA issues[1].
> 
> This release candidate is based on commit:
> 478286658055bb91737394c2065b92a7e92fb0c1 [2]
> 
> The source release rc2 is hosted at [3].
> The binary artifacts are hosted at [4][5][6][7].
> The changelog is located at [8].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [9] for how to validate a release candidate.
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 2.0.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 2.0.0 because...
> 
> * RC1 had a CMake issue which surfaced during the packaging builds so
> I had to create another release candidate.
> [1]: 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%202.0.0
> [2]: 
> https://github.com/apache/arrow/tree/478286658055bb91737394c2065b92a7e92fb0c1
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-2.0.0-rc2
> [4]: https://bintray.com/apache/arrow/centos-rc/2.0.0-rc2
> [5]: https://bintray.com/apache/arrow/debian-rc/2.0.0-rc2
> [6]: https://bintray.com/apache/arrow/python-rc/2.0.0-rc2
> [7]: https://bintray.com/apache/arrow/ubuntu-rc/2.0.0-rc2
> [8]: 
> https://github.com/apache/arrow/blob/478286658055bb91737394c2065b92a7e92fb0c1/CHANGELOG.md
> [9]: 
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates



Re: [C++] Arrow to ORC type conversion

2020-10-18 Thread Uwe L. Korn
This sounds reasonable from an Arrow perspective, you might want to CC the ORC 
list as well or ask someone there to co-review your work in the adapter.

Uwe

> Am 18.10.2020 um 17:24 schrieb Ying Zhou :
> 
> Hi,
> 
> I’m developing the adapter that converts Arrow Arrays, ChunkedArrays, 
> RecordBatches and Tables into ORC files. Given the ORC Specification and 
> Arrow Columnar Format. 
> 
> Here is my current type mapping:
> 
> Type::type::NA -> nulllptr
> Type::type::BOOL -> liborc::TypeKind::BOOLEAN
> Type::type::UINT8 -> liborc::TypeKind::BYTE
> Type::type::INT8 -> liborc::TypeKind::BYTE
> Type::type::UINT16 -> liborc::TypeKind::SHORT
> Type::type::INT16 -> liborc::TypeKind::SHORT
> Type::type::UINT32 -> liborc::TypeKind::INT
> Type::type::INT32 -> liborc::TypeKind::INT
> Type::type::INTERVAL_MONTH -> liborc::TypeKind:INT
> Type::type::UINT64 -> liborc::TypeKind::LONG
> Type::type::INT64 -> liborc::TypeKind::LONG
> Type::type::INTERVAL_DAY_TIME -> liborc::TypeKind:LONG
> Type::type::DURATION -> liborc::TypeKind::LONG
> Type::type::HALF_FLOAT -> liborc::TypeKind::FLOAT
> Type::type::FLOAT -> liborc::TypeKind::FLOAT
> Type::type::DOUBLE -> liborc::TypeKind::DOUBLE
> Type::type::STRING -> liborc::TypeKind::STRING
> Type::type::LARGE_STRING -> liborc::TypeKind::STRING
> Type::type::FIXED_SIZE_BINARY -> liborc::TypeKind::CHAR
> Type::type::BINARY -> liborc::TypeKind::BINARY
> Type::type::LARGE_BINARY -> liborc::TypeKind::BINARY
> Type::type::DATE32 -> liborc::TypeKind::DATE
> Type::type::TIMESTAMP -> liborc::TypeKind::TIMESTAMP
> Type::type::TIME32 -> liborc::TypeKind::TIMESTAMP
> Type::type::TIME64 -> liborc::TypeKind::TIMESTAMP
> Type::type::DATE64 -> liborc::TypeKind::TIMESTAMP
> Type::type::DECIMAL -> liborc::TypeKind::DECIMAL
> Type::type::LIST -> liborc::TypeKind::LIST
> Type::type::FIXED_SIZE_LIST -> liborc::TypeKind::LIST
> Type::type::LARGE_LIST -> liborc::TypeKind::LIST
> Type::type::STRUCT -> liborc::TypeKind::STRUCT
> Type::type::MAP -> liborc::TypeKind::MAP
> Type::type::DENSE_UNION -> liborc::TypeKind::UNION
> Type::type::SPARSE_UNION -> liborc::TypeKind::UNION
> Type::type::DICTIONARY -> the ORC version of its value type
> 
> There are some concerns particularly related to duration types which don’t 
> exist for Apache ORC which I have to convert to integers. Is my current 
> mapping reasonable? Thanks!
> 
> Best,
> Ying Zhou



Re: [VOTE] Accept donation of Julia implementation for Apache Arrow

2020-10-14 Thread Uwe L. Korn
+1 (binding)

On Wed, Oct 14, 2020, at 3:58 PM, Andy Grove wrote:
> +1 (binding)
> 
> On Tue, Oct 13, 2020 at 8:26 PM Fan Liya  wrote:
> 
> > +1 (non-binding)
> >
> > Best,
> > Liya Fan
> >
> >
> > On Wed, Oct 14, 2020 at 9:02 AM Sutou Kouhei  wrote:
> >
> > > +1 (binding)
> > >
> > > In 
> > >   "[VOTE] Accept donation of Julia implementation for Apache Arrow" on
> > > Mon, 12 Oct 2020 13:35:14 -0700,
> > >   Neal Richardson  wrote:
> > >
> > > > Hi all,
> > > > Last month [1] Jacob Quinn proposed donating Arrow.jl, a Julia
> > > > implementation of Arrow, to the Apache Arrow project. The community has
> > > had
> > > > an opportunity to discuss this and there do not seem to be objections.
> > > > There is a pull request now available:
> > > >
> > > > https://github.com/apache/arrow/pull/8448
> > > >
> > > > This vote is to determine if the Arrow PMC is in favor of accepting
> > this
> > > > donation. If the vote passes, the PMC and the authors of the code will
> > > work
> > > > together to complete the ASF IP Clearance process
> > > > (https://incubator.apache.org/ip-clearance/) and import this Julia
> > > > implementation into Apache Arrow.
> > > >
> > > > [ ] +1 : Accept contribution of Arrow.jl
> > > > [ ]  0 : No opinion
> > > > [ ] -1 : Reject contribution because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > The vote will be open for at least 72 hours.
> > > >
> > > > Thanks,
> > > > Neal
> > > >
> > > > [1]:
> > > >
> > >
> > https://lists.apache.org/thread.html/r5f6d0525b3e83de0f7faa2f91a844f1b40c78da4da25a8c0242f5624%40%3Cdev.arrow.apache.org%3E
> > >
> >
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-10-02-0

2020-10-02 Thread Uwe L. Korn
conda-*-aarch64 hit the 1h time limit on drone.io, probably not easy to fix.

On Fri, Oct 2, 2020, at 12:23 PM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-10-02-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0
> 
> Failed Tasks:
> - conda-linux-gcc-py36-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-drone-conda-linux-gcc-py36-aarch64
> - conda-linux-gcc-py38-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-drone-conda-linux-gcc-py38-aarch64
> - debian-buster-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-travis-debian-buster-arm64
> - gandiva-jar-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-travis-gandiva-jar-xenial
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-travis-homebrew-cpp
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-conda-cpp-valgrind
> - test-conda-python-3.7-hdfs-2.9.2:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-conda-python-3.7-hdfs-2.9.2
> - test-conda-python-3.7-spark-branch-3.0:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-conda-python-3.7-spark-branch-3.0
> - test-conda-python-3.8-spark-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-conda-python-3.8-spark-master
> - test-r-linux-as-cran:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-r-linux-as-cran
> - test-ubuntu-20.04-cpp-14:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-test-ubuntu-20.04-cpp-14
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-centos-8-amd64
> - conda-clean:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-clean
> - conda-linux-gcc-py36-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py36-cpu
> - conda-linux-gcc-py36-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py36-cuda
> - conda-linux-gcc-py37-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-drone-conda-linux-gcc-py37-aarch64
> - conda-linux-gcc-py37-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py37-cpu
> - conda-linux-gcc-py37-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py37-cuda
> - conda-linux-gcc-py38-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py38-cpu
> - conda-linux-gcc-py38-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-linux-gcc-py38-cuda
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-osx-clang-py38
> - conda-win-vs2017-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-win-vs2017-py36
> - conda-win-vs2017-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-win-vs2017-py37
> - conda-win-vs2017-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-azure-conda-win-vs2017-py38
> - debian-buster-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-10-02-0-github-debian-buster-amd64
> - debian-stretch-amd64:
>   URL: 
> 

Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-09-27-0

2020-09-27 Thread Uwe L. Korn
I'm working on a fix for the conda failures in 
https://github.com/apache/arrow/pull/8282

On Sun, Sep 27, 2020, at 12:20 PM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-09-27-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0
> 
> Failed Tasks:
> - conda-linux-gcc-py36-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py36-aarch64
> - conda-linux-gcc-py36-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cpu
> - conda-linux-gcc-py36-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py36-cuda
> - conda-linux-gcc-py37-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py37-aarch64
> - conda-linux-gcc-py37-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cpu
> - conda-linux-gcc-py37-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py37-cuda
> - conda-linux-gcc-py38-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-drone-conda-linux-gcc-py38-aarch64
> - conda-linux-gcc-py38-cpu:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cpu
> - conda-linux-gcc-py38-cuda:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-linux-gcc-py38-cuda
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-osx-clang-py38
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-gandiva-jar-xenial
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-homebrew-cpp
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-cpp-valgrind
> - test-conda-python-3.7-hdfs-2.9.2:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-hdfs-2.9.2
> - test-conda-python-3.7-spark-branch-3.0:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.7-spark-branch-3.0
> - test-conda-python-3.8-spark-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-test-conda-python-3.8-spark-master
> - wheel-osx-cp35m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp35m
> - wheel-osx-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp36m
> - wheel-osx-cp37m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp37m
> - wheel-osx-cp38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-wheel-osx-cp38
> - wheel-win-cp36m:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-appveyor-wheel-win-cp36m
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-github-centos-8-amd64
> - conda-clean:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-clean
> - conda-win-vs2017-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py36
> - conda-win-vs2017-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-09-27-0-azure-conda-win-vs2017-py37
> - 

Re: Closing Plasma issues?

2020-09-07 Thread Uwe L. Korn
If we do that, we should be clear with that and remove the code. Shipping 
Plasma as part of the release and not maintaining it as other parts of the 
Arrow libraries seems inconsistent and will just be an annoyance to user to 
find a partly unusable component.

Cheers
Uwe

On Mon, Sep 7, 2020, at 7:58 PM, Robert Nishihara wrote:
> I think that makes sense. They can be reopened if necessary.
> 
> On Mon, Sep 7, 2020 at 9:49 AM Antoine Pitrou  wrote:
> 
> >
> > Hello,
> >
> > The Plasma component in our C++ codebase is now unmaintained, with the
> > original authors and maintainers having forked the codebase on their
> > side.  I propose to close the open Plasma issues in JIRA as "Won't fix".
> >  Is there any concern about this?
> >
> > Regards
> >
> > Antoine.
> >
>


Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-24 Thread Uwe L. Korn
1.  [done] rebase master
2.  [done] upload source
3.  [kszucs] upload binaries
4.  [ ] update website
5.  [ ] upload ruby gems
6.  [ ] upload js packages
8.  [ ] upload C# packages
9.  [andygrove] upload rust crates
10. [uwe] update conda recipes
11. [kszucs] upload wheels to pypi
12. [ ] update homebrew packages
13. [kszucs] update maven artifacts
14. [ ] update msys2
15. [ ] update R packages
16. [ ] update docs

On Fri, Jul 24, 2020, at 3:03 PM, Krisztián Szűcs wrote:
> Thanks Andy! Updated the checklist:
> 
> 1.  [done] rebase master
> 2.  [done] upload source
> 3.  [kszucs] upload binaries
> 4.  [ ] update website
> 5.  [ ] upload ruby gems
> 6.  [ ] upload js packages
> 8.  [ ] upload C# packages
> 9.  [andygrove] upload rust crates
> 10. [ ] update conda recipes
> 11. [kszucs] upload wheels to pypi
> 12. [ ] update homebrew packages
> 13. [kszucs] update maven artifacts
> 14. [ ] update msys2
> 15. [ ] update R packages
> 16. [ ] update docs
> 
> On Fri, Jul 24, 2020 at 2:26 PM Andy Grove  wrote:
> >
> > I can take care  of the Rust release.
> >
> > On Fri, Jul 24, 2020, 6:24 AM Krisztián Szűcs 
> > wrote:
> >
> > > Here is the post-release checklist, I'm working on the first three
> > > tasks. If anyone would like to help, please let me know.
> > >
> > > 1.  [kszucs] rebase master
> > > 2.  [kszucs] upload source
> > > 3.  [kszucs] upload binaries
> > > 4.  [ ] update website
> > > 5.  [ ] upload ruby gems
> > > 6.  [ ] upload js packages
> > > 8.  [ ] upload C# packages
> > > 9.  [ ] upload rust crates
> > > 10. [ ] update conda recipes
> > > 11. [ ] upload wheels to pypi
> > > 12. [ ] update homebrew packages
> > > 13. [ ] update maven artifacts
> > > 14. [ ] update msys2
> > > 15. [ ] update R packages
> > > 16. [ ] update docs
> > >
> > > On Fri, Jul 24, 2020 at 2:19 PM Krisztián Szűcs
> > >  wrote:
> > > >
> > > > The VOTE carries with 6 binding +1 votes and 1 non-binding +1 vote and
> > > > 1 non-binding +0 vote.
> > > >
> > > > I'm starting the post release tasks and keep you posted about the
> > > > remaining tasks.
> > > >
> > > > Thanks everyone!
> > > >
> > > > On Fri, Jul 24, 2020 at 2:16 PM Krisztián Szűcs
> > > >  wrote:
> > > > >
> > > > > On Tue, Jul 21, 2020 at 3:57 PM Ryan Murray  wrote:
> > > > > >
> > > > > > +0 (non-binding)
> > > > > >
> > > > > >
> > > > > > I verified source, release, binaries, integration tests for Python,
> > > C++,
> > > > > > Java. All went fine except for a failed test in c++ Gandiva: [
> > > FAILED  ]
> > > > > > TestProjector.TestDateTime
> > > > >
> > > > > It's hard to evaluate without any context.
> > > > >
> > > > > I executed specifically this test case locally and it has passed.
> > > > > According to the rest of the votes I don't consider it as a blocker.
> > > > >
> > > > > >
> > > > > >
> > > > > > Not sure if this is known or expected?
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 21, 2020 at 1:32 PM Andy Grove 
> > > wrote:
> > > > > >
> > > > > > > +1 (binding) on testing the Rust implementation only.
> > > > > > >
> > > > > > > I did notice that the release script is not updating all the
> > > versions
> > > > > > > correctly and I filed a JIRA [1].
> > > > > > >
> > > > > > > This shouldn't prevent the release though since this one version
> > > number can
> > > > > > > be updated manually when we publish the crates.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-9537
> > > > > > >
> > > > > > > On Mon, Jul 20, 2020 at 8:08 PM Krisztián Szűcs <
> > > szucs.kriszt...@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I would like to propose the following release candidate (RC2) of
> > > Apache
> > > > > > > > Arrow version 1.0.0. This is a release consisting of 838
> > > > > > > > resolved JIRA issues[1].
> > > > > > > >
> > > > > > > > This release candidate is based on commit:
> > > > > > > > b0d623957db820de4f1ff0a5ebd3e888194a48f0 [2]
> > > > > > > >
> > > > > > > > The source release rc2 is hosted at [3].
> > > > > > > > The binary artifacts are hosted at [4][5][6][7].
> > > > > > > > The changelog is located at [8].
> > > > > > > >
> > > > > > > > Please download, verify checksums and signatures, run the unit
> > > tests,
> > > > > > > > and vote on the release. See [9] for how to validate a release
> > > candidate.
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > >
> > > > > > > > [ ] +1 Release this as Apache Arrow 1.0.0
> > > > > > > > [ ] +0
> > > > > > > > [ ] -1 Do not release this as Apache Arrow 1.0.0 because...
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > >
> > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0
> > > > > > > > [2]:
> > > > > > > >
> > > > > > >
> > > https://github.com/apache/arrow/tree/b0d623957db820de4f1ff0a5ebd3e888194a48f0
> > > > > > > > [3]:
> > > 

Re: Introducing Cylon

2020-07-22 Thread Uwe L. Korn
Hello Niranda,

cool to see this. Feel free to open a PR to add it to the Powered By list on 
https://arrow.apache.org/powered_by/

Cheers
Uwe

On Tue, Jul 21, 2020, at 8:03 PM, Niranda Perera wrote:
> Hi all,
> 
> We would like to introduce Cylon to the Arrow community. It is an
> open-source, lean distributed data processing library using the Arrow data
> format underneath. It is developed in C++ with bindings to Java, and
> Python. It has an in-memory Table API that integrates with PyArrow Table
> API. Cylon enables distributed data operations (ex: join (all variants),
> union, intersection, difference, etc). It can be imported as a library to
> existing applications or operate as a standalone framework. At the moment
> it is using OpenMPI to distribute and communicate. It is released with
> Apache License.
> 
> We are developing a distributed data-frame API on top of Cylon table API.
> It would be similar to the Dask/ Modin data-frame. Our initial experiments
> show promising performance. Cylon language bindings are also very
> lightweight. We just had the very first release of Cylon. We would like to
> hear from the Arrow community... Any comments, ideas are most appreciated!
> 
> Web visit - https://cylondata.org/  
> Github - https://github.com/cylondata/cylon
> Paper - https://arxiv.org/abs/2007.09589
> 
> Best
> -- 
> Niranda Perera
> @n1r44 
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda
>


Re: [DRAFT] Arrow Board Report July 2020

2020-07-08 Thread Uwe L. Korn
Happy with the current version. I think this gives enough input for the board. 
We have so much things happening that are much better presented in the process 
of the 1.0 release.

On Wed, Jul 8, 2020, at 12:52 AM, Micah Kornfield wrote:
> Worth mentioning the website work?
> 
> On Tue, Jul 7, 2020 at 3:47 PM Neal Richardson 
> wrote:
> 
> > Looks good to me.
> >
> > On Tue, Jul 7, 2020 at 3:45 PM Wes McKinney  wrote:
> >
> > > Please let me know comments or additions to the below. This is due to
> > > the ASF board tomorrow.
> > >
> > > ---
> > >
> > > ## Description:
> > >
> > > The mission of Apache Arrow is the creation and maintenance of software
> > > related
> > > to columnar in-memory processing and data interchange. The project has
> > some
> > > level of support for 11 different programming languages.
> > >
> > > ## Issues:
> > >
> > > - There are no issues requiring board attention at this time.
> > >
> > > ## Membership Data:
> > > Apache Arrow was founded 2016-01-19 (4 years ago)
> > > There are currently 52 committers and 30 PMC members in this project.
> > > The Committer-to-PMC ratio is roughly 7:4.
> > >
> > > Community changes, past quarter:
> > > - No new PMC members. Last addition was Francois Saint-Jacques on
> > > 2020-03-04.
> > > - Liya Fan was added as committer on 2020-06-09
> > > - Ji Liu was added as committer on 2020-06-09
> > >
> > > ## Project Activity:
> > >
> > > - We made the 0.17.0 and 0.17.1 releases since the last board report. The
> > >   community is readying a 1.0.0 release which will formally mark
> > stability
> > > in
> > >   the Arrow columnar format binary protocol and a move to semantic
> > > versioning
> > >   of the Arrow libraries.
> > > - We moved new JIRA issue notifications off of the dev@ mailing list to
> > > issues@
> > >   and created a new jira@ mailing list to capture the full JIRA
> > firehose.
> > > The
> > >   reasoning is that this could help encourage more participating in
> > mailing
> > >   list discussions.
> > > - We made changes in the codebase to remove uses of potentially
> > non-neutral
> > >   language that has been changed in many other open source projects.
> > >
> > > ## Community Health:
> > >
> > > The project and contributor base continues to grow in size and
> > > scope. We now have over 500 unique contributors since the
> > > creation of the project.
> > >
> >
>


Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I did try the approach to not link against pyarrow but leave out the symbols, 
just ensure pyarrow is imported before the vaex extension. This works 
out-of-the-box on macOS but fails on Linux as symbols have a scope there. 
Adding the following lines to load Arrow into the global scope made it work 
though:

import ctypes
arrow_python = ctypes.CDLL('libarrow.so', ctypes.RTLD_GLOBAL)
libarrow_python = ctypes.CDLL('libarrow_python.so', ctypes.RTLD_GLOBAL)

On Thu, Jul 2, 2020, at 4:32 PM, Uwe L. Korn wrote:
> I had so much fun with the wheels in the past, I'm now a happy member 
> of conda-forge core instead :D
> 
> The good thing first:
> 
> * The C++ ABI didn't change between the manylinux versions, it is the 
> old one in all cases. So you mix & match manylinux versions.
> 
> The sad things:
> 
> * The manylinuxX standard are intented to provide a way to ship 
> *self-contained* wheels that run on any recent Linux. The important 
> part here is that they need to be self-contained. Having a binary 
> dependency on another wheel is actually not allowed.
> * Thus the snowflake-python-connector ships the libarrow.so it was 
> build with as part of its wheel. In this case auditwheel is happy with 
> the wheel.
> * It is working with numpy as a dependency because NumPy linkage is 
> similar to the import lib behaviour on Windows: You don't actually link 
> against numpy but you statically link a set of functions that are 
> resolved to NumPy's function when you import numpy. Quick googling 
> leads to https://github.com/yugr/Implib.so which could provide 
> something similar for Linux.
> * You could actually omit linking to libarrow and try to populate the 
> symbols before you load the library. This is how the Python symbols are 
> available to extensions without linking to libpython.
> 
> 
> On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> > Ok, thanks!
> > 
> > I'm setting up a repo with an example here, using pybind11:
> > https://github.com/vaexio/vaex-arrow-ext
> > 
> > and I'll just try all possible combinations and report back.
> > 
> > cheers,
> > 
> > Maarten Breddels
> > Software engineer / consultant / data scientist
> > Python / C++ / Javascript / Jupyter
> > www.maartenbreddels.com / vaex.io
> > maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> > [image: Twitter] <https://twitter.com/maartenbreddels>[image: Github]
> > <https://github.com/maartenbreddels>[image: LinkedIn]
> > <https://linkedin.com/in/maartenbreddels>[image: Skype]
> > 
> > 
> > 
> > 
> > Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> > jorisvandenboss...@gmail.com>:
> > 
> > > Also no concrete answer, but one such example is turbodbc, I think.
> > > But it seems they only have conda binary packages, and don't
> > > distribute wheels ..
> > > (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> > > so not that relevant as comparison (they also need to build against an
> > > odbc driver in addition to arrow).
> > > But maybe Uwe has some more experience in this regard (and with
> > > attempts building wheels for turbodbc, eg
> > > https://github.com/blue-yonder/turbodbc/pull/108).
> > >
> > > Joris
> > >
> > > On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> > > >
> > > >
> > > > Hi Maarten,
> > > >
> > > > Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > > > >
> > > > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> > > vaex
> > > > > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > > > > wheel, work in an environment where someone installed a pyarrow 2014
> > > > > wheel, or build from source, or installed from conda-forge?
> > > >
> > > > I have no idea about the concrete answer, but it probably depends
> > > > whether the libstdc++ ABI changed between those two versions.  I'm
> > > > afraid you'll have to experiment yourself.
> > > >
> > > > (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> > > > https://arrow.apache.org/docs/format/CDataInterface.html
> > > > though of course you won't have access to all the useful helpers in the
> > > > Arrow C++ library)
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
> >
>


Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
Hello Tim,

thanks for the hint. I see that you build arrow by yourselves in the 
Dockerfile. Could it be that in the end you statically link the arrow libraries?

As there are no wheel on PyPI, I couldn't verify whether that assumption is 
true.

Best
Uwe

On Thu, Jul 2, 2020, at 4:53 PM, Tim Paine wrote:
> We spent a ton of time on this for perspective, the end result is a 
> mostly compatible set of wheels for most platforms, I believe we 
> skipped py2 but nobody cares about those anyway. We link against 
> libarrow and libarrow_python on Linux, on windows we vendor them all 
> into our library. Feel free to scrape the perspective repo's cmake 
> lists and setup.py for details.
> 
> Tim Paine
> tim.paine.nyc
> 
> > On Jul 2, 2020, at 10:32, Uwe L. Korn  wrote:
> > 
> > I had so much fun with the wheels in the past, I'm now a happy member of 
> > conda-forge core instead :D
> > 
> > The good thing first:
> > 
> > * The C++ ABI didn't change between the manylinux versions, it is the old 
> > one in all cases. So you mix & match manylinux versions.
> > 
> > The sad things:
> > 
> > * The manylinuxX standard are intented to provide a way to ship 
> > *self-contained* wheels that run on any recent Linux. The important part 
> > here is that they need to be self-contained. Having a binary dependency on 
> > another wheel is actually not allowed.
> > * Thus the snowflake-python-connector ships the libarrow.so it was build 
> > with as part of its wheel. In this case auditwheel is happy with the wheel.
> > * It is working with numpy as a dependency because NumPy linkage is similar 
> > to the import lib behaviour on Windows: You don't actually link against 
> > numpy but you statically link a set of functions that are resolved to 
> > NumPy's function when you import numpy. Quick googling leads to 
> > https://github.com/yugr/Implib.so which could provide something similar for 
> > Linux.
> > * You could actually omit linking to libarrow and try to populate the 
> > symbols before you load the library. This is how the Python symbols are 
> > available to extensions without linking to libpython.
> > 
> > 
> >> On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> >> Ok, thanks!
> >> 
> >> I'm setting up a repo with an example here, using pybind11:
> >> https://github.com/vaexio/vaex-arrow-ext
> >> 
> >> and I'll just try all possible combinations and report back.
> >> 
> >> cheers,
> >> 
> >> Maarten Breddels
> >> Software engineer / consultant / data scientist
> >> Python / C++ / Javascript / Jupyter
> >> www.maartenbreddels.com / vaex.io
> >> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> >> [image: Twitter] <https://twitter.com/maartenbreddels>[image: Github]
> >> <https://github.com/maartenbreddels>[image: LinkedIn]
> >> <https://linkedin.com/in/maartenbreddels>[image: Skype]
> >> 
> >> 
> >> 
> >> 
> >> Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> >> jorisvandenboss...@gmail.com>:
> >> 
> >>> Also no concrete answer, but one such example is turbodbc, I think.
> >>> But it seems they only have conda binary packages, and don't
> >>> distribute wheels ..
> >>> (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> >>> so not that relevant as comparison (they also need to build against an
> >>> odbc driver in addition to arrow).
> >>> But maybe Uwe has some more experience in this regard (and with
> >>> attempts building wheels for turbodbc, eg
> >>> https://github.com/blue-yonder/turbodbc/pull/108).
> >>> 
> >>> Joris
> >>> 
> >>> On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> >>>> 
> >>>> 
> >>>> Hi Maarten,
> >>>> 
> >>>> Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> >>>>> 
> >>>>> Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> >>> vaex
> >>>>> extension distributed as a 2010 wheel, and build with the pyarrow 2010
> >>>>> wheel, work in an environment where someone installed a pyarrow 2014
> >>>>> wheel, or build from source, or installed from conda-forge?
> >>>> 
> >>>> I have no idea about the concrete answer, but it probably depends
> >>>> whether the libstdc++ ABI changed between those two versions.  I'm
> >>>> afraid you'll have to experiment yourself.
> >>>> 
> >>>> (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> >>>> https://arrow.apache.org/docs/format/CDataInterface.html
> >>>> though of course you won't have access to all the useful helpers in the
> >>>> Arrow C++ library)
> >>>> 
> >>>> Regards
> >>>> 
> >>>> Antoine.
> >>>> 
> >>>> 
> >>> 
> >> 
>


Re: Developing a C++ Python extension

2020-07-02 Thread Uwe L. Korn
I had so much fun with the wheels in the past, I'm now a happy member of 
conda-forge core instead :D

The good thing first:

* The C++ ABI didn't change between the manylinux versions, it is the old one 
in all cases. So you mix & match manylinux versions.

The sad things:

* The manylinuxX standard are intented to provide a way to ship 
*self-contained* wheels that run on any recent Linux. The important part here 
is that they need to be self-contained. Having a binary dependency on another 
wheel is actually not allowed.
* Thus the snowflake-python-connector ships the libarrow.so it was build with 
as part of its wheel. In this case auditwheel is happy with the wheel.
* It is working with numpy as a dependency because NumPy linkage is similar to 
the import lib behaviour on Windows: You don't actually link against numpy but 
you statically link a set of functions that are resolved to NumPy's function 
when you import numpy. Quick googling leads to 
https://github.com/yugr/Implib.so which could provide something similar for 
Linux.
* You could actually omit linking to libarrow and try to populate the symbols 
before you load the library. This is how the Python symbols are available to 
extensions without linking to libpython.


On Thu, Jul 2, 2020, at 2:43 PM, Maarten Breddels wrote:
> Ok, thanks!
> 
> I'm setting up a repo with an example here, using pybind11:
> https://github.com/vaexio/vaex-arrow-ext
> 
> and I'll just try all possible combinations and report back.
> 
> cheers,
> 
> Maarten Breddels
> Software engineer / consultant / data scientist
> Python / C++ / Javascript / Jupyter
> www.maartenbreddels.com / vaex.io
> maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
> [image: Twitter] [image: Github]
> [image: LinkedIn]
> [image: Skype]
> 
> 
> 
> 
> Op do 2 jul. 2020 om 14:32 schreef Joris Van den Bossche <
> jorisvandenboss...@gmail.com>:
> 
> > Also no concrete answer, but one such example is turbodbc, I think.
> > But it seems they only have conda binary packages, and don't
> > distribute wheels ..
> > (https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html),
> > so not that relevant as comparison (they also need to build against an
> > odbc driver in addition to arrow).
> > But maybe Uwe has some more experience in this regard (and with
> > attempts building wheels for turbodbc, eg
> > https://github.com/blue-yonder/turbodbc/pull/108).
> >
> > Joris
> >
> > On Thu, 2 Jul 2020 at 11:05, Antoine Pitrou  wrote:
> > >
> > >
> > > Hi Maarten,
> > >
> > > Le 02/07/2020 à 10:53, Maarten Breddels a écrit :
> > > >
> > > > Also, I see pyarrow distributes manylinux1/2010/2014 wheels. Would a
> > vaex
> > > > extension distributed as a 2010 wheel, and build with the pyarrow 2010
> > > > wheel, work in an environment where someone installed a pyarrow 2014
> > > > wheel, or build from source, or installed from conda-forge?
> > >
> > > I have no idea about the concrete answer, but it probably depends
> > > whether the libstdc++ ABI changed between those two versions.  I'm
> > > afraid you'll have to experiment yourself.
> > >
> > > (if you want to eschew C++ ABI issues, you may use the C Data Interface:
> > > https://arrow.apache.org/docs/format/CDataInterface.html
> > > though of course you won't have access to all the useful helpers in the
> > > Arrow C++ library)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-30 Thread Uwe L. Korn
I'm also in favor of disabling support for now. Having to deal with broken 
files or the detection of various incompatible implementations in the long-term 
will harm more than not supporting LZ4 for a while. Snappy is generally more 
used than LZ4 in this category as it has been available since the inception of 
Parquet and thus should be considered as a viable alternative.

Cheers
Uwe

On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
> >
> >
> > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > hi folks,
> > >
> > > (cross-posting to dev@arrow and dev@parquet since there are
> > > stakeholders in both places)
> > >
> > > It seems there are still problems at least with the C++ implementation
> > > of LZ4 compression in Parquet files
> > >
> > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > https://issues.apache.org/jira/browse/PARQUET-1878
> >
> > I don't have any particular opinion on how to solve the LZ4 issue, but
> > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > compression algorithms available, and they span different parts of the
> > speed/compression spectrum, so it would be a pity to disable one of them.
> 
> It's true, however I think it's worse to write LZ4-compressed files
> that cannot be read by other Parquet implementations (if that's what's
> happening as I understand it?). If we are indeed shipping something
> broken then we either should fix it or disable it until it can be
> fixed.
> 
> > Regards
> >
> > Antoine.
>


Re: [VOTE] Permitting unsigned integers for Arrow dictionary indices

2020-06-30 Thread Uwe L. Korn
+1 (binding)

On Tue, Jun 30, 2020, at 6:24 AM, Wes McKinney wrote:
> +1 (binding)
> 
> On Mon, Jun 29, 2020 at 11:11 PM Ben Kietzman  
> wrote:
> >
> > +1 (non binding)
> >
> > On Mon, Jun 29, 2020, 18:00 Wes McKinney  wrote:
> >
> > > Hi,
> > >
> > > As discussed on the mailing list [1], it has been proposed to allow
> > > the use of unsigned dictionary indices (which is already technically
> > > possible in our metadata serialization, but not allowed according to
> > > the language of the columnar specification), with the following
> > > caveats:
> > >
> > > * Unless part of an application's requirements (e.g. if it is
> > > necessary to store dictionaries with size 128 to 255 more compactly),
> > > implementations are recommended to prefer signed over unsigned
> > > integers, with int32 continuing to be the "default" when the indexType
> > > field of DictionaryEncoding is null
> > > * uint64 dictionary indices, while permitted, are strongly not
> > > recommended unless required by an application as they are more
> > > difficult to work with in some programming languages (e.g. Java) and
> > > they do not offer the storage size benefits that uint8 and uint16 do.
> > >
> > > This change is backwards compatible, but not forward compatible for
> > > all implementations (for example, C++ will reject unsigned integers).
> > > Assuming that the V5 MetadataVersion change is accepted, to protect
> > > against forward compatibility issues such implementations would be
> > > recommended to not allow unsigned dictionary indices to be serialized
> > > using V4 MetadataVersion.
> > >
> > > A PR with the changes to the columnar specification (possibly subject
> > > to some clarifying language) is at [2].
> > >
> > > The vote will be open for at least 72 hours.
> > >
> > > [ ] +1 Accept changes to allow unsigned integer dictionary indices
> > > [ ] +0
> > > [ ] -1 Do not accept because...
> > >
> > > [1]:
> > > https://lists.apache.org/thread.html/r746e0a76c4737a2cf48dec656103677169bebb303240e62ae1c66d35%40%3Cdev.arrow.apache.org%3E
> > > [2]: https://github.com/apache/arrow/pull/7567
> > >
>


Re: [VOTE] Increment MetadataVersion in Schema.fbs from V4 to V5 for 1.0.0 release

2020-06-30 Thread Uwe L. Korn
+1 (binding)

On Tue, Jun 30, 2020, at 11:11 AM, Neville Dipale wrote:
> +1 (non-binding)
> 
> On Tue, 30 Jun 2020 at 06:29, Ben Kietzman  wrote:
> 
> > +1 (non binding)
> >
> > On Tue, Jun 30, 2020, 00:25 Wes McKinney  wrote:
> >
> > > +1 (binding)
> > >
> > > On Mon, Jun 29, 2020 at 10:49 PM Micah Kornfield 
> > > wrote:
> > > >
> > > > +1 (binding)
> > > >
> > > > On Mon, Jun 29, 2020 at 2:43 PM Wes McKinney 
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > As discussed on the mailing list [1], in order to demarcate the
> > > > > pre-1.0.0 and post-1.0.0 worlds, and to allow the
> > > > > forward-compatibility-protection changes we are making to actually
> > > > > work (i.e. so that libraries can recognize that they have received
> > > > > data with a feature that they do not support), I have proposed to
> > > > > increment the MetadataVersion from V4 to V5. Additionally, if the
> > > > > union validity bitmap changes are accepted, the MetadataVersion could
> > > > > be used to control whether unions are permitted to be serialized or
> > > > > not (with V4 -- used by v0.8.0 to v0.17.1, unions would not be
> > > > > permitted).
> > > > >
> > > > > Since there have been no backward incompatible changes to the Arrow
> > > > > format since 0.8.0, this would be no different, and (aside from the
> > > > > union issue) libraries supporting V5 are expected to accept BOTH V4
> > > > > and V5 so that backward compatibility is not broken, and any
> > > > > serialized data from prior versions of the Arrow libraries (0.8.0
> > > > > onward) will continue to be readable.
> > > > >
> > > > > Implementations are recommended, but not required, to provide an
> > > > > optional "V4 compatibility mode" for forward compatibility
> > > > > (serializing data from >= 1.0.0 that needs to be readable by older
> > > > > libraries, e.g. Spark deployments stuck on an older Java-Arrow
> > > > > version). In this compatibility mode, non-forward-compatible features
> > > > > added in 1.0.0 and beyond would not be permitted.
> > > > >
> > > > > A PR with the changes to Schema.fbs (possibly subject to some
> > > > > clarifying changes to the comments) is at [2].
> > > > >
> > > > > Once the PR is merged, it will be necessary for implementations to be
> > > > > updated and tested as appropriate at minimum to validate that
> > backward
> > > > > compatibility is preserved (i.e. V4 IPC payloads are still readable
> > --
> > > > > we have some in apache/arrow-testing and can add more as needed).
> > > > >
> > > > > The vote will be open for at least 72 hours.
> > > > >
> > > > > [ ] +1 Accept addition of MetadataVersion::V5 along with its general
> > > > > implications above
> > > > > [ ] +0
> > > > > [ ] -1 Do not accept because...
> > > > >
> > > > > [1]:
> > > > >
> > >
> > https://lists.apache.org/thread.html/r856822cc366d944b3ecdf32c2ea9b1ad8fc9d12507baa2f2840a64b6%40%3Cdev.arrow.apache.org%3E
> > > > > [2]: https://github.com/apache/arrow/pull/7566
> > > > >
> > >
> >
>


Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-23 Thread Uwe L. Korn
FTR: We can use the latest(!) clang for all platform for conda and wheels. It 
isn't probably even that much of a complicated setup. 

On Mon, Jun 22, 2020, at 5:42 PM, Francois Saint-Jacques wrote:
> We should aim to improve the performance of the most widely used
> *default* packages, which are python pip, python conda and R (all
> platforms). AFAIK, both pip (manywheel) and conda use gcc on Linux by
> default. R uses gcc on Linux and mingw (gcc) on Windows. I suppose
> (haven't checked) that clang is used on OSX via brew. Thus, by
> default, almost all users are going to use a gcc compiled version of
> arrow on Linux.
> 
> François
> 
> On Mon, Jun 22, 2020 at 9:47 AM Wes McKinney  wrote:
> >
> > Based on some of my performance work recently, I'm growing
> > uncomfortable with using gcc as the performance baseline since the
> > results can be significantly different (sometimes 3-4x or more on
> > certain fast algorithms) from clang and MSVC. The perf results on
> > https://github.com/apache/arrow/pull/7506 were really surprising --
> > some benchmarks that showed 2-5x performance improvement on both clang
> > and MSVC shows small regressions (20-30%) with gcc.
> >
> > I don't think we need a hard-and-fast rule about whether to accept PRs
> > based on benchmarks but there are a few guiding criteria:
> >
> > * How much binary size does the new code add? I think many of us would
> > agree that a 20% performance increase on some algorithm might not be
> > worth adding 500KB to libarrow.so
> > * Is the code generally faster across the major compiler targets (gcc,
> > clang, MSVC)?
> >
> > I think that using clang as a baseline for informational benchmarks
> > would be good, but ultimately we need to be systematically collecting
> > data on all the major compiilers. Some time ago I proposed building a
> > Continuous Benchmarking framework
> > (https://github.com/conbench/conbench/blob/master/doc/REQUIREMENTS.md)
> > for use with Arrow (and outside of Arrow, too) so I hope that this
> > will be able to help.
> >
> > - Wes
> >
> > On Mon, Jun 22, 2020 at 5:12 AM Yibo Cai  wrote:
> > >
> > > On 6/22/20 5:07 PM, Antoine Pitrou wrote:
> > > >
> > > > Le 22/06/2020 à 06:27, Micah Kornfield a écrit :
> > > >> There has been significant effort recently trying to optimize our C++
> > > >> code.  One  thing that seems to come up frequently is different 
> > > >> benchmark
> > > >> results between GCC and Clang.  Even different versions of the same
> > > >> compiler can yield significantly different results on the same code.
> > > >>
> > > >> I would like to propose that we choose a specific compiler and version 
> > > >> on
> > > >> Linux for evaluating performance related PRs.  PRs would only be 
> > > >> accepted
> > > >> if they improve the benchmarks under the selected version.
> > > >
> > > > Would this be a hard rule or just a guideline?  There are many ways in
> > > > which benchmark numbers can be improved or deteriorated by a PR, and in
> > > > some cases that doesn't matter (benchmarks are not always realistic, and
> > > > they are not representative of every workload).
> > > >
> > >
> > > I agree that microbenchmark is not always useful, focusing too much on
> > > improving microbenchmark result gives me feeling of "overfit" (to some
> > > specific microarchitecture, compiler, or use case).
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
>


Re: [DISCUSS][C++] Performance work and compiler standardization for linux

2020-06-22 Thread Uwe L. Korn
With my conda-forge background, I would suggest to use clang as a performance 
baseline, because it's currently the only compiler that works reliably on all 
platforms. Most Linux distributions are nowadays built with gcc, also making a 
strong argument, but on OSX and Windows the picture is a bit different.

Linux: One can use gcc or clang, as long as you use the right flags, they are 
interchangable.
macOS: There are gcc builds but I have seen in recent time people solely using 
clang, either the one coming with Xcode or a newer build.
Windows (MSVC-toolchain): You can use MSVC or clang, while most people prefer 
MSVC, clang is well-tested (e.g. Chrome uses it)
Windows (MSYS2-toolchain): Mostly uses GCC but clang is also supported. There 
were some informal mentions that clang performs better here when using AVX2.

Cheers
Uwe 


On Mon, Jun 22, 2020, at 6:27 AM, Micah Kornfield wrote:
> There has been significant effort recently trying to optimize our C++
> code.  One  thing that seems to come up frequently is different benchmark
> results between GCC and Clang.  Even different versions of the same
> compiler can yield significantly different results on the same code.
> 
> I would like to propose that we choose a specific compiler and version on
> Linux for evaluating performance related PRs.  PRs would only be accepted
> if they improve the benchmarks under the selected version. Other
> performance related PRs would still be acceptable if they improve
> benchmarks for a different compiler as long as they don't make the primary
> one worse and don't hurt maintainability of the code.  I also don't think
> this should change our CI matrix in any way.
> 
> There are other variables that potentially come into play for evaluating
> benchmarks (architecture, OS Version, etc) but if we need to limit these in
> a similar way, I think they should be handled as separate discussions.
> 
> I'm not clear on the limitations that our distributions place on compilers,
> but ideally we'd pick a compiler that can produce most/all of our binary
> artifacts.
> 
> Could someone more familiar with the release toolchain comment on good
> options to consider?
> 
> Any other criteria we should consider? Other thoughts?
> 
> Thanks,
> Micah
>


[C++] Kernels with scalar input

2020-06-17 Thread Uwe L. Korn
Hello all,

I'm trying to implement a `contains` kernel that takes as an input a 
StringArray and a scalar string (see 
https://issues.apache.org/jira/browse/ARROW-9160). I feel confident with the 
rest of the new Kernels setup but I didn't find an example kernel where we also 
pass in a scalar attribute. Can someone point me to an approach on how to do 
this?

Best
Uwe


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn


On Fri, Jun 5, 2020, at 3:13 PM, Rémi Dettai wrote:
> Hi Antoine !
> > I would indeed have expected jemalloc to do that (remap the pages)
> I have no idea about the performance gain this would provide (if any).
> Could be interesting to explore.

This would actually be the most interesting thing. In general, getting access 
to the pages mapped into RAM would improve in a lot of more situations, not 
just realloction. For example, when you take a small slice of a large array and 
only pass this on, but don't an explicit reference to the array, you will still 
indirectly hold on the larger memory size. Having an allocator that would 
understand the mapping between pages and memory block would allow us to free 
the pages that are not part of the view.

Also, yes: For CSV and JSON, we don't have size estimates beforehand. There 
this would be a great performance improvement.

Best
Uwe


Re: [DISCUSS] [C++] custom allocator for large objects

2020-06-05 Thread Uwe L. Korn
Hello Rémi,

under the hood jemalloc does quite similar things to what you describe. I'm not 
sure what the offset is in the current version but in earlier releases, it used 
a different allocation strategy for objects above 4MB. For the initial large 
allocation, you will see quite some copies as mmap is returning a new base 
address and it isn't able to reuse an existing space. This could probably be 
circumvented by a single large allocation which is free'd again. 

As your suggestions don't seem to be specific to Arrow, why not contribute them 
directly to jemalloc? They are much better in reviewing allocator code than we 
are.

Still, when we read a column, we should be able to determine its final size 
from the Parquet metadata. Maybe we're passing an information there not along?

Best,
Uwe

On Thu, Jun 4, 2020, at 5:48 PM, Rémi Dettai wrote:
> When creating large arrays, Arrow uses realloc quite intensively.
> 
> I have an example where y read a gzipped parquet column (strings) that
> expands from 8MB to 100+MB when loaded into Arrow. Of course Jemalloc
> cannot anticipate this and every reallocate call above 1MB (the most
> critical ones) ends up being a copy.
> 
> I think that knowing that we like using realloc in Arrow, we could come up
> with an allocator for large objects that would behave a lot better than
> Jemalloc. For smaller objects, this allocator could just let the memory
> request being handled by Jemalloc. Not trying to outsmart the brilliant
> guys from Facebook and co ;-) But for larger objects, we could adopt a
> custom strategy:
> - if an allocation or a re-allocation larger than 1MB (or maybe even 512K)
> is made on our memory pool, call mmap with size XGB (X being slightly
> smaller than the total physical memory on the system). This is ok because
> mmap will not physically allocate this memory as long as it is not touched.
> - we keep track of all allocations that we created this way, by storing the
> pointer + the actual used size inside our XGB alloc in a map.
> - when growing an alloc mmaped this way we will always have contiguous
> memory available, (otherwise we would already have OOMed because X is the
> physical memory size).
> - when reducing the alloc size we can free with madvice (optional: if the
> alloc becomes small enough, we might copy it back into a Jemalloc
> allocation).
> 
> I am not an expert of these matters, and I just learned what an allocator
> really is, so my approach might be naive. In this case feel free ton
> enlighten me!
> 
> Please note that I'm not sure about the level of portability of this
> solution.
> 
> Have a nice day!
> 
> Remi
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-05-30-0

2020-05-30 Thread Uwe L. Korn
https://github.com/apache/arrow/pull/7305 should enable us to upload conda 
packages again.

On Sat, May 30, 2020, at 12:10 PM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-05-30-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0
> 
> Failed Tasks:
> - conda-linux-gcc-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-azure-conda-win-vs2015-py38
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-homebrew-cpp
> - homebrew-r-autobrew:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-homebrew-r-autobrew
> - test-conda-python-3.7-dask-latest:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.7-dask-latest
> - test-conda-python-3.7-spark-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.7-spark-master
> - test-conda-python-3.8-dask-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.8-dask-master
> - test-conda-python-3.8-jpype:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.8-jpype
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-centos-8-amd64
> - debian-buster-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-debian-buster-amd64
> - debian-buster-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-debian-buster-arm64
> - debian-stretch-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-debian-stretch-amd64
> - debian-stretch-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-debian-stretch-arm64
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-travis-gandiva-jar-xenial
> - nuget:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-nuget
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-cpp-valgrind
> - test-conda-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-cpp
> - test-conda-python-3.6-pandas-0.23:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.6-pandas-0.23
> - test-conda-python-3.6:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.6
> - test-conda-python-3.7-hdfs-2.9.2:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-30-0-github-test-conda-python-3.7-hdfs-2.9.2
> - 

Re: Arrow sync all at 12pm US-Eastern / 16:00 UTC

2020-05-27 Thread Uwe L. Korn
No, we are just talking about removing static libraries from conda-forge that 
may be (/have been) used as part of the Arrow build. This shouldn't affect any 
non-conda Arrow users/developers.

Cheers,
Uwe

On Wed, May 27, 2020, at 6:53 PM, Rémi Dettai wrote:
> @Uwe: Just a quick question about the static build, I'm not sure I
> understood correctly: are we talking about removing the install step for
> the static libraries or the arrow_static target as a whole?
> 
> Le mer. 27 mai 2020 à 18:34, Neal Richardson 
> a écrit :
> 
> > Attendees:
> > Mahmut Bulut
> > Projjal Chanda
> > Rémi Dettai
> > Laurent Goujon
> > Andy Grove
> > Uwe Korn
> > Micah Kornfield
> > Wes McKinney
> > Rok Mihevc
> > Neal Richardson
> > François Saint-Jacques
> >
> > Discussion:
> > * patch queue is growing, please review things
> > * 1.0
> >   * Timeline: targeting July 1
> >   * Desire to add forward compatibility changes to format
> >   * Documentation: opportunity to reposition website, add user guides
> >   * Integration testing: now in Rust, but questions about which tests are
> > running/passing
> > * Adding extra checks to Rust for undefined behavior
> > * Conda: question about static library usage; also heads up that they're
> > consolidating the recipes
> >
> > On Wed, May 27, 2020 at 8:03 AM Wes McKinney  wrote:
> >
> > > The usual biweekly call will be held at
> > >
> > > https://meet.google.com/vtm-teks-phx
> > >
> > > All are welcome. Meeting notes will be posted to the mailing list
> > afterword
> > >
> >
>


Re: [NIGHTLY] Arrow Build Report for Job nightly-2020-05-26-0

2020-05-26 Thread Uwe L. Korn
The conda builds are failing are we have exceed the storage available for our 
conda repository:

You currently have 3 public packages and 0 packages that require to be 
authenticated.
Using 10.0 GB of 3.0 GB storage

I guess we something that deletes old builds automatically.

On Tue, May 26, 2020, at 12:12 PM, Crossbow wrote:
> 
> Arrow Build Report for Job nightly-2020-05-26-0
> 
> All tasks: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0
> 
> Failed Tasks:
> - conda-linux-gcc-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-linux-gcc-py36
> - conda-linux-gcc-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-linux-gcc-py37
> - conda-linux-gcc-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-linux-gcc-py38
> - conda-osx-clang-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-osx-clang-py36
> - conda-osx-clang-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-osx-clang-py37
> - conda-osx-clang-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-osx-clang-py38
> - conda-win-vs2015-py36:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-win-vs2015-py36
> - conda-win-vs2015-py37:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-win-vs2015-py37
> - conda-win-vs2015-py38:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-azure-conda-win-vs2015-py38
> - homebrew-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-homebrew-cpp
> - homebrew-r-autobrew:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-homebrew-r-autobrew
> - test-conda-cpp-valgrind:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-cpp-valgrind
> - test-conda-python-3.7-dask-latest:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-python-3.7-dask-latest
> - test-conda-python-3.7-spark-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-python-3.7-spark-master
> - test-conda-python-3.8-dask-master:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-python-3.8-dask-master
> - ubuntu-focal-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-ubuntu-focal-arm64
> - ubuntu-xenial-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-ubuntu-xenial-arm64
> 
> Succeeded Tasks:
> - centos-6-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-centos-6-amd64
> - centos-7-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-centos-7-aarch64
> - centos-7-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-centos-7-amd64
> - centos-8-aarch64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-centos-8-aarch64
> - centos-8-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-centos-8-amd64
> - debian-buster-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-debian-buster-amd64
> - debian-buster-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-debian-buster-arm64
> - debian-stretch-amd64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-debian-stretch-amd64
> - debian-stretch-arm64:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-debian-stretch-arm64
> - gandiva-jar-osx:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-gandiva-jar-osx
> - gandiva-jar-xenial:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-travis-gandiva-jar-xenial
> - nuget:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-nuget
> - test-conda-cpp:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-cpp
> - test-conda-python-3.6-pandas-0.23:
>   URL: 
> https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-05-26-0-github-test-conda-python-3.6-pandas-0.23
> - test-conda-python-3.6:
>   URL: 
> 

Re: Arrow Flight connector for SQL Server

2020-05-21 Thread Uwe L. Korn
Hello Brendan,

welcome to the community. In addition to the folks at Dremio, I wanted to make 
you aware of the Python ODBC client library 
https://github.com/blue-yonder/turbodbc which provides a high-performance 
ODBC<->Arrow adapter. It is especially popular with MS SQL Server users as the 
fastest known way to retrieve query results as DataFrames in Python from SQL 
Server, considerably faster than pandas.read_sql or using pyodbc directly.

While being the fastest known, I can tell that still there is a lot time CPU 
spent in the ODBC driver "transforming" results so that it matches the ODBC 
interface. At least here, one could get possibly a lot better performance when 
retrieving large columnar results from SQL Server when going through Arrow 
Flight as an interface instead being constraint to the less efficient ODBC for 
this use case. Currently there is a performance difference of 50x between 
reading the data from a Parquet file and reading the same data from a table in 
SQL Server (simple SELECT, no filtering or so). As nearly for the full 
retrieval time the client CPU is at 100%, using a more efficient protocol for 
data transferral could roughly translate into a 10x speedup.

Best,
Uwe

On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> Hi everyone,
> 
> I wanted to informally introduce myself. My name is Brendan Niebruegge, 
> I'm a Software Engineer in our SQL Server extensibility team here at 
> Microsoft. I am leading an effort to explore how we could integrate 
> Arrow Flight with SQL Server. We think this could be a very interesting 
> integration that would both benefit SQL Server and the Arrow community. 
> We are very early in our thoughts so I thought it best to reach out 
> here and see if you had any thoughts or suggestions for me. What would 
> be the best way to socialize my thoughts to date? I am keen to learn 
> and deepen my knowledge of Arrow as well so please let me know how I 
> can be of help to the community.
> 
> Please feel free to reach out anytime (email:brn...@microsoft.com)
> 
> Thanks,
> Brendan Niebruegge
> 
>


Re: [VOTE] Release Apache Arrow 0.17.1 - RC1

2020-05-19 Thread Uwe L. Korn
Current status:

1.  [done] rebase (not required for a patch release)
2.  [done] upload source
3.  [done] upload binaries
4.  [done|in-pr] update website
5.  [done] upload ruby gems
6.  [ ] upload js packages
8.  [done] upload C# packages
9.  [ ] upload rust crates
10. [done] update conda recipes (dropped ppc64le support though)
11. [done] upload wheels to pypi
12. [nealrichardson] update homebrew packages
13. [done] update maven artifacts
14. [done|in-pr] update msys2
15. [nealrichardson] update R packages
16. [done|in-pr] update docs

On Tue, May 19, 2020, at 12:06 AM, Krisztián Szűcs wrote:
> Current status:
> 
> 1.  [done] rebase (not required for a patch release)
> 2.  [done] upload source
> 3.  [done] upload binaries
> 4.  [done|in-pr] update website
> 5.  [done] upload ruby gems
> 6.  [ ] upload js packages
> 8.  [done] upload C# packages
> 9.  [ ] upload rust crates
> 10. [in-progress|in-pr] update conda recipes
> 11. [done] upload wheels to pypi
> 12. [nealrichardson] update homebrew packages
> 13. [done] update maven artifacts
> 14. [done|in-pr] update msys2
> 15. [nealrichardson] update R packages
> 16. [done|in-pr] update docs
> 
> On Mon, May 18, 2020 at 11:33 PM Sutou Kouhei  wrote:
> >
> > >> 14. [ ] update msys2
> > >
> > > I'll do this.
> >
> > Oh, sorry. Krisztián already did!
> >
> > In <20200519.062731.1037230979568376433@clear-code.com>
> >   "Re: [VOTE] Release Apache Arrow 0.17.1 - RC1" on Tue, 19 May 2020 
> > 06:27:31 +0900 (JST),
> >   Sutou Kouhei  wrote:
> >
> > >> 14. [ ] update msys2
> > >
> > > I'll do this.
> > >
> > > In 
> > >   "Re: [VOTE] Release Apache Arrow 0.17.1 - RC1" on Mon, 18 May 2020 
> > > 22:37:50 +0200,
> > >   Krisztián Szűcs  wrote:
> > >
> > >> 1.  [done] rebase (not required for a patch release)
> > >> 2.  [done] upload source
> > >> 3.  [done] upload binaries
> > >> 4.  [done] update website
> > >> 5.  [done] upload ruby gems
> > >> 6.  [ ] upload js packages
> > >> No javascript changes were applied to the patch release, for
> > >> consistency we might want to choose to upload a 0.17.1 release though.
> > >> 8.  [done] upload C# packages
> > >> 9.  [ ] upload rust crates
> > >> @Andy Grove the patch release doesn't affect the rust implementation.
> > >> We can update the crates despite that no changes were made, not sure
> > >> what policy should we choose here (same as with JS)
> > >> 10. [ ] update conda recipes
> > >> @Uwe Korn seems like arrow-cpp-feedstock have not picked up the new
> > >> release once again
> > >> 11. [done] upload wheels to pypi
> > >> 12. [nealrichardson] update homebrew packages
> > >> 13. [done] update maven artifacts
> > >> 14. [ ] update msys2
> > >> 15. [nealrichardson] update R packages
> > >> 16. [in-progress] update docs
> > >>
> > >> On Mon, May 18, 2020 at 10:29 PM Krisztián Szűcs
> > >>  wrote:
> > >>>
> > >>> Current status:
> > >>>
> > >>> 1.  [done] rebase (not required for a patch release)
> > >>> 2.  [done] upload source
> > >>> 3.  [done] upload binaries
> > >>> 4.  [done] update website
> > >>> 5.  [ ] upload ruby gems
> > >>> 6.  [ ] upload js packages
> > >>> 8.  [ ] upload C# packages
> > >>> 9.  [ ] upload rust crates
> > >>> 10. [ ] update conda recipes
> > >>> 11. [done] upload wheels to pypi
> > >>> 12. [nealrichardson] update homebrew packages
> > >>> 13. [done] update maven artifacts
> > >>> 14. [ ] update msys2
> > >>> 15. [nealrichardson] update R packages
> > >>> 16. [in-progress] update docs
> > >>>
> > >>> On Mon, May 18, 2020 at 9:39 PM Neal Richardson
> > >>>  wrote:
> > >>> >
> > >>> > I'm working on the R stuff and can do Homebrew again.
> > >>> >
> > >>> > Neal
> > >>> >
> > >>> > On Mon, May 18, 2020 at 12:30 PM Krisztián Szűcs 
> > >>> > 
> > >>> > wrote:
> > >>> >
> > >>> > > Any help with the post release tasks is welcome!
> > >>> > >
> > >>> > > Checklist:
> > >>> > > 1.  [done] rebase (not required for a patch release)
> > >>> > > 2.  [done] upload source
> > >>> > > 3.  [in-progress] upload binaries
> > >>> > > 4.  [done] update website
> > >>> > > 5.  [ ] upload ruby gems
> > >>> > > 6.  [ ] upload js packages
> > >>> > > 8.  [ ] upload C# packages
> > >>> > > 9.  [ ] upload rust crates
> > >>> > > 10. [ ] update conda recipes
> > >>> > > 11. [kszucs] upload wheels to pypi
> > >>> > > 12. [ ] update homebrew packages
> > >>> > > 13. [kszucs] update maven artifacts
> > >>> > > 14. [ ] update msys2
> > >>> > > 15. [ ] update R packages
> > >>> > > 16. [in-progress] update docs
> > >>> > >
> > >>> > > @Neal Richardson I think you need to handle the R packages.
> > >>> > >
> > >>> > > On Mon, May 18, 2020 at 8:08 PM Krisztián Szűcs
> > >>> > >  wrote:
> > >>> > > >
> > >>> > > > The VOTE carries with 6 binding +1 votes and 1 non-binding +1 
> > >>> > > > vote.
> > >>> > > >
> > >>> > > > I'm starting the post release tasks and keep posted about the 
> > >>> > > > remaining
> > >>> > > tasks.
> > >>> > > >
> > >>> > > > Thanks everyone!
> > >>> > > >
> > >>> > > >
> > >>> > > > On 

Re: [Python] black vs. autopep8

2020-04-09 Thread Uwe L. Korn
The non-configurability of black is one of the strongest arguments I see for 
black. The codestyle will always be subjective. From previous discussions I 
know that my personal preference of readability conflicts with that of Antoine 
and Wes, so will probably others. We have the same issue with using the Google 
C++ style guide in C++. Not everyone agrees with it but at least it is a guide 
that is adopted widely (and thus seen in other projects quite often) and has 
well outlined reasonings for most of its choices.

On Wed, Apr 8, 2020, at 10:25 PM, Xinbin Huang wrote:
> Another option that we can look into is yapf (
> https://github.com/google/yapf). It is similar to black but more tweakable.
> Also, it is recently adopted by the Apache Beam project. PR is here
> https://github.com/apache/beam/pull/10684/files
> 
> Bin
> 
> On Wed, Apr 8, 2020 at 1:18 PM Wes McKinney  wrote:
> 
> > I don't think it's possible unfortunately. From the README: "Black
> > reformats entire files in place. It is not configurable."
> >
> > The main concern about Black is the impact that it has on readability.
> > I share this concern as the subjective style choices it makes are
> > quite different from the way I've been writing Python code the last 12
> > years. That doesn't mean I'm right and it's wrong, of course. In this
> > project, it doesn't seem like we're losing much energy to people
> > reformatting arguments lists or adding line breaks here and there,
> > etc. FWIW, from reading the Black docs, it doesn't strike me that the
> > tool is designed to maximize readability, but rather to make
> > formatting deterministic and reduce code diffs caused by reformatting.
> >
> > My opinion is that we should simply avoid debating code style in code
> > reviews as long as the code passes the PEP8 checks. Employing autopep8
> > is an improvement over the status quo (which is that developers must
> > fix flake8 / PEP8 warnings by hand). We can always revisit the Black
> > discussion in the future if we find that subjective code formatting
> > (beyond PEP8 compliance) is taking up our energy inappropriately.
> >
> > On Wed, Apr 8, 2020 at 2:13 PM Rok Mihevc  wrote:
> > >
> > > Could we 'tone down' black to get the desired behavior?
> > > I'm ok with either tool.
> > >
> > > Rok
> > >
> > > On Wed, Apr 8, 2020 at 8:00 PM Wes McKinney  wrote:
> > >
> > > > On Wed, Apr 8, 2020 at 12:47 PM Neal Richardson
> > > >  wrote:
> > > > >
> > > > > So autopep8 doesn't fix everything? Sounds inferior to me. That
> > said, I'm
> > > > > in favor of any resolution that increases our automation of this and
> > > > > decreases the energy we expend debating it.
> > > >
> > > > It does fix everything, where "everything" is compliance with PEP8,
> > > > which I think is the thing we are most interested in.
> > > >
> > > > Black makes a bunch of other arbitrary (albeit consistent)
> > > > reformattings that don't affect PEP8 compliance.
> > > >
> > > > > Neal
> > > > >
> > > > >
> > > > > On Wed, Apr 8, 2020 at 10:34 AM Wes McKinney 
> > > > wrote:
> > > > >
> > > > > > Circling back on this, it seems there isn't consensus about
> > switching
> > > > > > to Black, and using autopep8 at least will give us an easy way to
> > > > > > maintain PEP8 compliance and help contributors fix linting failures
> > > > > > detected by flake8 (but not all, e.g. unused imports would need to
> > be
> > > > > > manually removed). Would everyone be on board with using autopep8?
> > > > > >
> > > > > > On Thu, Apr 2, 2020 at 9:07 AM Wes McKinney 
> > > > wrote:
> > > > > > >
> > > > > > > I'm personally fine with the Black changes. After the one-time
> > cost
> > > > of
> > > > > > > reformatting the codebase, it will take any personal preferences
> > out
> > > > > > > of code formatting (I admit that I have several myself, but I
> > don't
> > > > > > > mind the normalization provided by Black). I hope that Cython
> > support
> > > > > > > comes soon since a great deal of our code is Cython
> > > > > > >
> > > > > > > On Thu, Apr 2, 2020 at 9:00 AM Jacek Pliszka <
> > > > jacek.plis...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I believe amount of changes is not that important.
> > > > > > > >
> > > > > > > > In my opinion, what matters is which format will allow
> > reviewers
> > > > to be
> > > > > > > > more efficient.
> > > > > > > >
> > > > > > > > The committer can always reformat as they like. It is harder
> > for
> > > > the
> > > > > > reviewer.
> > > > > > > >
> > > > > > > > BR,
> > > > > > > >
> > > > > > > > Jacek
> > > > > > > >
> > > > > > > > czw., 2 kwi 2020 o 15:32 Antoine Pitrou 
> > > > > > napisał(a):
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > PS: in both cases, Cython files are not processed.  autopep8
> > is
> > > > > > actually
> > > > > > > > > able to process them, but the comparison wouldn't be
> > > > > > apples-to-apples.
> > > > > > > > >
> > > > > > > > > (that said, autopep8 

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

2020-04-07 Thread Uwe L. Korn
I did a bit more research on JIRA and we seem to have this open topic there 
also in https://issues.apache.org/jira/browse/ARROW-6959 which is the similar 
topic as my mail is about and in 
https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some of 
the interfaces with reference-types.

On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> Hello all,
> 
> I'm in the progress of changing the implementation of the Take kernel 
> to work on ChunkedArrays without concatenating them into a single Array 
> first. While working on the implementation, I realised that we switch 
> often between Datum and the specific-typed parameters. This works quite 
> well for the combination of Array& and Datum(shared_ptr) as 
> here the reference object with type Array& always carries a shared 
> reference with it, so switching between Array& and its Datum is quite 
> easy.
> 
> In contrast, we cannot do this with ChunkedArrays as here the Datum 
> requires a shared_ptr which cannot be constructed from 
> the reference type. Thus to allow interfaces like `Status 
> Take(FunctionContext* ctx, const ChunkedArray& values, const Array& 
> indices,` to pass successfully their arguments to the Kernel 
> implementation, we have to do:
> 
> a) Remove the references from the interface of the Take() function and 
> use `shared_ptr` instances everywhere.
> b) Add interfaces to kernels like the TakeKernel that allow calling 
> with specific references instead of Datum instances
> 
> Personally I would prefer b) as this allow us to make more use of the 
> C++ type system and would also avoid the shared_ptr overhead where not 
> necessary.
> 
> Cheers,
> Uwe
>


[C++] Compute: Datum and "ChunkedArray&" inputs

2020-04-07 Thread Uwe L. Korn
Hello all,

I'm in the progress of changing the implementation of the Take kernel to work 
on ChunkedArrays without concatenating them into a single Array first. While 
working on the implementation, I realised that we switch often between Datum 
and the specific-typed parameters. This works quite well for the combination of 
Array& and Datum(shared_ptr) as here the reference object with type 
Array& always carries a shared reference with it, so switching between Array& 
and its Datum is quite easy.

In contrast, we cannot do this with ChunkedArrays as here the Datum requires a 
shared_ptr which cannot be constructed from the reference type. 
Thus to allow interfaces like `Status Take(FunctionContext* ctx, const 
ChunkedArray& values, const Array& indices,` to pass successfully their 
arguments to the Kernel implementation, we have to do:

a) Remove the references from the interface of the Take() function and use 
`shared_ptr` instances everywhere.
b) Add interfaces to kernels like the TakeKernel that allow calling with 
specific references instead of Datum instances

Personally I would prefer b) as this allow us to make more use of the C++ type 
system and would also avoid the shared_ptr overhead where not necessary.

Cheers,
Uwe


Re: Proposal to use Black for automatic formatting of Python code

2020-03-27 Thread Uwe L. Korn
I'm also very much in favor of this.

For the black / cython support, I think the current state is reflected in 
https://github.com/pablogsal/black/tree/cython.

On Fri, Mar 27, 2020, at 4:40 AM, Micah Kornfield wrote:
> +1 from me as well.
> 
> On Thursday, March 26, 2020, Neal Richardson 
> wrote:
> 
> > I'm also in favor, very much so. Life is too short to hold strong opinions
> > about code style; you get used to whatever you're accustomed to seeing. And
> > I support using automation to remove manual nuisances like this.
> >
> > Neal
> >
> > On Thu, Mar 26, 2020 at 3:49 PM Wes McKinney  wrote:
> >
> > > I'm in favor of this even though I also probably won't like some of
> > > the formatting decisions it makes. Is there a sense of how far away
> > > Black is from having Cython support? I saw it was being worked on a
> > > while back.
> > >
> > > On Thu, Mar 26, 2020 at 2:37 PM Joris Van den Bossche
> > >  wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I would like to propose adopting Black as code formatter within the
> > > python
> > > > project. There is an older JIRA issue about this (
> > > > https://issues.apache.org/jira/browse/ARROW-5176), but bringing it to
> > > the
> > > > mailing list for wider attention.
> > > >
> > > > Black (https://github.com/ambv/black) is a tool for automatically
> > > > formatting python code in ways which flake8 and our other linters
> > approve
> > > > of (and fill a similar role to clang-format for C++ and cmake-format
> > for
> > > > cmake). It can also be added to the linting checks on CI and to the
> > > > pre-commit hooks like we now run flake8.
> > > > Using it ensures python code will be formatted consistently, and more
> > > > importantly automates this formatting, letting you focus on more
> > > important
> > > > matters.
> > > >
> > > > Black makes some specific formatting choices, and not everybody (me
> > > > included) will always like those choices (that's how it goes with
> > > something
> > > > subjective like formatting). But my experience with using it in some
> > > other
> > > > big python projects (pandas, dask) has been very positive. You very
> > > quickly
> > > > get used to how it looks, while it is much nicer to not have to worry
> > > about
> > > > formatting anymore.
> > > >
> > > > Best,
> > > > Joris
> > >
> >
>


Re: [VOTE] Release Apache Arrow 0.16.0 - RC2

2020-02-05 Thread Uwe L. Korn
I'm failing to verify C++ on macOS as it seems that we nowadays pull all 
dependencies from the system. Is there a known way to build & test on OSX with 
the script and use conda for the requirements? 

Otherwise I probably need to investe to create such a way.

Cheers
Uwe

On Wed, Feb 5, 2020, at 2:54 AM, Krisztián Szűcs wrote:
> Hi,
> 
> I've cherry-picked the wheel fix [1] on top of the 0.16 release tag,
> re-built the wheels using crossbow [2], and uploaded them to
> bintray [3] (also removed win-py38m).
> 
> Anyone who has voted after verifying the wheels, please re-run
> the verification script again for the wheels and re-vote.
> 
> Thanks, Krisztian
> 
> [1] 
> https://github.com/apache/arrow/commit/67e34c53b3be4c88348369f8109626b4a8a997aa
> [2] https://github.com/ursa-labs/crossbow/branches/all?query=build-733
> [3] https://bintray.com/apache/arrow/python-rc/0.16.0-rc2#files
> 
> On Tue, Feb 4, 2020 at 7:08 PM Wes McKinney  wrote:
> >
> > +1 (binding)
> >
> > Some patches were required to the verification scripts but I have run:
> >
> > * Full source verification on Ubuntu 18.04
> > * Linux binary verification
> > * Source verification on Windows 10 (needed ARROW-6757)
> > * Windows binary verification. Note that Python 3.8 wheel is broken
> > (see ARROW-7755). Whoever uploads the wheels to PyPI _SHOULD NOT_
> > upload this 3.8 wheel until we know what's wrong (if we upload a
> > broken wheel then `pip install pyarrow==0.16.0` will be permanently
> > broken on Windows/Python 3.8)
> >
> > On Mon, Feb 3, 2020 at 9:26 PM Francois Saint-Jacques
> >  wrote:
> > >
> > > Tested on ubuntu 18.04 for the source release.
> > >
> > > On Mon, Feb 3, 2020 at 10:07 PM Francois Saint-Jacques
> > >  wrote:
> > > >
> > > > +1
> > > >
> > > > Binaries verification didn't have any issues.
> > > > Sources verification worked with some local environment hiccups
> > > >
> > > > François
> > > >
> > > > On Mon, Feb 3, 2020 at 8:46 PM Andy Grove  wrote:
> > > > >
> > > > > +1 (binding) based on running the Rust tests
> > > > >
> > > > > Thanks.
> > > > >
> > > > > On Thu, Jan 30, 2020 at 8:13 PM Krisztián Szűcs 
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I would like to propose the following release candidate (RC2) of 
> > > > > > Apache
> > > > > > Arrow version 0.16.0. This is a release consisting of 728
> > > > > > resolved JIRA issues[1].
> > > > > >
> > > > > > This release candidate is based on commit:
> > > > > > 729a7689fd87572e6a14ad36f19cd579a8b8d9c5 [2]
> > > > > >
> > > > > > The source release rc2 is hosted at [3].
> > > > > > The binary artifacts are hosted at [4][5][6][7].
> > > > > > The changelog is located at [8].
> > > > > >
> > > > > > Please download, verify checksums and signatures, run the unit 
> > > > > > tests,
> > > > > > and vote on the release. See [9] for how to validate a release 
> > > > > > candidate.
> > > > > >
> > > > > > The vote will be open for at least 72 hours.
> > > > > >
> > > > > > [ ] +1 Release this as Apache Arrow 0.16.0
> > > > > > [ ] +0
> > > > > > [ ] -1 Do not release this as Apache Arrow 0.16.0 because...
> > > > > >
> > > > > > [1]:
> > > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%200.16.0
> > > > > > [2]:
> > > > > > https://github.com/apache/arrow/tree/729a7689fd87572e6a14ad36f19cd579a8b8d9c5
> > > > > > [3]: 
> > > > > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.16.0-rc2
> > > > > > [4]: https://bintray.com/apache/arrow/centos-rc/0.16.0-rc2
> > > > > > [5]: https://bintray.com/apache/arrow/debian-rc/0.16.0-rc2
> > > > > > [6]: https://bintray.com/apache/arrow/python-rc/0.16.0-rc2
> > > > > > [7]: https://bintray.com/apache/arrow/ubuntu-rc/0.16.0-rc2
> > > > > > [8]:
> > > > > > https://github.com/apache/arrow/blob/729a7689fd87572e6a14ad36f19cd579a8b8d9c5/CHANGELOG.md
> > > > > > [9]:
> > > > > > https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > > > > >
>


[Python] Exposing compute kernels

2019-12-17 Thread Uwe L. Korn
Hello all,

we have developed quite some compute kernels in C++ nowadays and I would like 
to call them from Python. We could expose the kernels on the Array/ChunkedArray 
classes themselves or as standalone functions (or as both). What would be the 
preferred way?

Also exposing them as standalone functions bears the question whether to have 
them in the top-level namespace (e.a. pyarrow.sum) or in a separate one (e.g. 
pyarrow.compute.sum).

Uwe


Re: Adding stronger warnings about pre-production Arrow IPC implementations (C#, Rust)

2019-11-22 Thread Uwe L. Korn
Hello Wes,

what about adding an implementation status (table) to the README of every 
language? Things like "Supports Arrow File Format", "Supports Arrow Stream 
Format", "Passes IPC integration tests", "Supports Flight" are things that are 
interesting to users and show how far an implementation has progressed.

Uwe

On Thu, Nov 21, 2019, at 5:33 PM, Wes McKinney wrote:
> hi folks,
> 
> We're accruing some bug reports relating to the C# library when it
> comes to interop with other languages
> 
> Nowhere in
> 
> https://github.com/apache/arrow/blob/master/csharp/README.md
> 
> is it clearly stated that such problems are to be anticipated.
> 
> Until C# participates in the integration tests as a first-class
> citizen I think we should insert a highly visible warning to not build
> any production applications depending on IPC-level interoperability
> (unless you're prepared to roll up your sleeves and debug/fix problems
> in the libraries). To be clear, it's good to have the bug reports, but
> we should also set expectations appropriately.
> 
> Note that this is not stated in the Rust README either, so it is
> probably a good idea to do this there, too.
> 
> - Wes
>


Re: [DISCUSS] Reviewing Arrow commit/code review policy

2019-10-14 Thread Uwe L. Korn
Hello all,

I also think we should stay with CTR for the moment. If we wanted to enforce 
RTC or at least a bit better notification for reviewers of certain parts of 
Arrow, we could setup a CODEOWNERS file[1] to add experts of a certain 
file/folder as a reviewer on PRs on Github.

Cheers
Uwe

[1] https://help.github.com/en/articles/about-code-owners

On Sun, Oct 13, 2019, at 2:03 AM, Andy Grove wrote:
> Wes,
> 
> Thanks for clarifying this. This will be very helpful for me while I work
> on the Rust DataFusion crate since we have a small number of committers
> today. I will still generally make PRs available for review (unless they
> are trivial changes) but being able to merge without review when the other
> committers are busy will be very helpful for momentum with some of the
> features that I would like to see in the 1.0.0 release.
> 
> Thanks,
> 
> Andy.
> 
> On Sat, Oct 12, 2019 at 2:51 PM Wes McKinney  wrote:
> 
> > hi folks,
> >
> > We've added many new committers to Apache Arrow over the last 3+
> > years, so I thought it would be worthwhile to review our commit and
> > code review policy for everyone's benefit.
> >
> > Since the beginning of the project, Arrow has been in "Commit Then
> > Review" mode (aka CTR).
> >
> > https://www.apache.org/foundation/glossary.html#CommitThenReview
> >
> > The idea of CTR is that committers can make changes at will with the
> > understanding that if there is some disagreement or if work is vetoed,
> > then changes may be reverted.
> >
> > In particular, in CTR if a committer submits a patch, they are able to
> > +1 and merge their own patch. Generally, though, as a matter of
> > courtesy to the community, for non-trivial patches it is a good idea
> > to allow time for code review.
> >
> > More mature projects, or ones with potentially contentious governance
> > / political issues, sometimes adopt "Review-Then-Commit" (RTC) which
> > requires a more structured sign-off process from other committers.
> > While Apache Arrow is more mature now, the diversity of the project
> > has resulted in a lot of spread-out code ownership. I think that RTC
> > at this stage would cause hardship for contributors on some components
> > where there are not a lot of active code reviewers.
> >
> > Personally, I am OK to stick with CTR until we start experiencing
> > problems. Overall I think we have a healthy dynamic amongst the
> > project's nearly 50 committers and we have had to revert patches
> > relatively rarely.
> >
> > Any thoughts from others?
> >
> > Thanks
> > Wes
> >
>


Re: [DISCUSS] C-level in-process array protocol

2019-10-08 Thread Uwe L. Korn
I'm not sure whether flatbuffers is actually an issue in the end but keeping it 
out of the C-API definitely simplifies it a bit adoption-wise. I don't think 
that though that using protobuf would make a difference here.

In general, I really like the C-interface work as sadly C-APIs are still the 
most accessible ones. Even when using the official Arrow C++ library, I often 
want access to the underlying data with some other non-C++ processing library 
having the C-interface is making my life easier. In my case I'm working with 
Numba (LLVM-based JIT for a subset of numerical Python) and this is not easily 
supporting interfaces to C++ but can work with C FFI calls directly. 

Uwe

On Tue, Oct 8, 2019, at 8:54 PM, Jacques Nadeau wrote:
> I removing all my objections to this work.
> 
> I wish there was more feedback from additional community members. I
> continue to be concerned about fragmentation. I don't agree with the
> arguments here that we need to add a new api to make it easy for people to
> *not* use Arrow codebase. It seems like a punt on building useful libraries
> within the project that will ultimately hurt the interoperability story.
> 
> As a side note, it seems like much of this is about people's distaste for
> flatbuffers. I know I regret using it. If we had a chance to do it over
> again, I would have chosen to use protobuf for everything except the data
> header, where I would hand write the encoding (since it is so simple
> anyway). If it is such a problem that people are contorting to work around
> it, maybe we should address that? Just a thought.
> 
> Thanks for the discourse and patience.
> 
> On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield 
> wrote:
> 
> > Hi Wes,
> > I agree for third-parties "A" (Field data structures) is the most useful.
> >
> > At least in my mind the discussion was for both first and third-parties.  I
> > was trying to point out that "A" is less necessary as a first step for
> > first-party integrations and could potentially require more effort if we
> > already have the code that does "B" (field reassembly).
> >
> > Thanks,
> > Micah
> >
> > On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney  wrote:
> >
> > > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield 
> > > wrote:
> > > >
> > > > I've tried to summarize my understanding of the debate so far and give
> > > some
> > > > initial thoughts. I think there are two potentially different sets of
> > > users
> > > > that we are targeting with a stable C API/ABI ourselves and external
> > > > parties.
> > > >
> > > > 1.  Different language implementations within the Arrow project that
> > want
> > > > to call into each other's code.  We still don't have a great story
> > around
> > > > this in terms of reusable libraries and questions like [1] are a
> > > motivating
> > > > examples of making something better in this context.
> > > > 2.  third-parties wishing to support/integrate with Arrow.  Some
> > > > conjectures about these users:
> > > >   - Users in this group are NOT necessarily familiar with existing
> > > > technologies Arrow uses (i.e. flatbuffers)
> > > >   - The stability of the API is the primary concern (consumers don't
> > want
> > > > to change when a new version of the library ships)
> > > >   - An important secondary concern is additional libraries that need to
> > > be
> > > > integrated in addition to the API
> > > >
> > > > The main debate points seems to be:
> > > >
> > > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> > > additional
> > > > column oriented API become too much of a maintenance headache/cause
> > > > fragmentation?
> > > >
> > > >  - In my mind the question here is which set of users we are
> > > prioritizing.
> > > > IMO the combination of flatbuffers and translation to/from RecordBatch
> > > > format offers too much friction to make it easy for a third-party
> > > > implementer to use. If we are prioritizing for our own internal
> > > use-cases I
> > > > think we should try out a RecordBatch+Flatbuffers based C-API. We
> > already
> > > > have all the necessary building blocks.
> > > >
> > >
> > > If a C function passes you a string containing a RecordBatch
> > > Flatbuffers message, what happens next? This message has to be
> > > reassembled into a recursive data structure before you can "do"
> > > anything with it. Are we expecting every third party project to
> > > implement:
> > >
> > > A. Data structures appropriate to represent a logical "field" in a
> > > record batch (which have to be recursive to account for nested types'
> > > children)
> > > B. The logic to convert from the flattened Flatbuffers representation
> > > to some implementation of A
> > >
> > > I'm arguing that we should provide both to third parties. To build B,
> > > you need A. Some consumers will only use A. This discussion is
> > > essentially about developing an ultraminimalist "drop-in" C
> > > implementation of A.
> > >
> > > > 2.  How onerous is the dependency on flat-buffers 

Re: [DRAFT] Apache Arrow Board Report - October 2019

2019-10-08 Thread Uwe L. Korn
I'm not sure what qualifies for "board attention" but it seems that CI is a 
critical problem in Apache projects, not just Arrow. Should we raise that?

Uwe

On Tue, Oct 8, 2019, at 12:00 AM, Wes McKinney wrote:
> Here is a start for our Q3 board report
> 
> ## Description:
> The mission of Apache Arrow is the creation and maintenance of software 
> related
> to columnar in-memory processing and data interchange
> 
> ## Issues:
> There are no issues requiring board attention at this time
> 
> ## Membership Data:
> * Apache Arrow was founded 2016-01-19 (4 years ago)
> * There are currently 48 committers and 28 PMC members in this project.
> * The Committer-to-PMC ratio is roughly 3:2.
> 
> Community changes, past quarter:
> - Micah Kornfield was added to the PMC on 2019-08-21
> - Sebastien Binet was added to the PMC on 2019-08-21
> - Ben Kietzman was added as committer on 2019-09-07
> - David Li was added as committer on 2019-08-30
> - Kenta Murata was added as committer on 2019-09-05
> - Neal Richardson was added as committer on 2019-09-05
> - Praveen Kumar was added as committer on 2019-07-14
> 
> ## Project Activity:
> 
> * The project has just made a 0.15.0 release.
> * We are discussing ways to make the Arrow libraries as accessible as possible
>   to downstream projects for minimal use cases while allowing the development
>   of more comprehensive "standard libraries" with larger dependency stacks in
>   the project
> * We plan to make a 1.0.0 release as our next major release, at which time we
>   will declare that the Arrow binary protocol is stable with forward and
>   backward compatibility guarantees
> * We are struggling with Continuous Integration scalability as the project has
>   definitely outgrown what Travis CI and Appveyor can do for us. We are
>   exploring alternative solutions such as Buildbot, Buildkite (see
>   INFRA-19217), and GitHub Actions to provide a path to migrate away from
>   Travis CI / Appveyor
> 
> ## Community Health:
> 
> * The community is overall healthy, with the aforementioned concerns around CI
>   scalability. New contributors frequently take notice of the long build queue
>   times when submitting pull requests.
>


Re: Collecting Arrow critique and our roadmap on that

2019-09-23 Thread Uwe L. Korn
Thanks to the all contributions that already came in. I made some more 
additions and hope to turn this into a PR to the site soon.

Uwe

On Fri, Sep 20, 2019, at 10:46 AM, Micah Kornfield wrote:
> I think this is a good idea, as well.  I added comments and additions on
> the document.
> 
> On Thu, Sep 19, 2019 at 11:47 AM Neal Richardson <
> neal.p.richard...@gmail.com> wrote:
> 
> > Uwe, I think this is an excellent idea. I've started
> >
> > https://docs.google.com/document/d/1cgN7mYzH30URDTaioHsCP2d80wKKHDNs9f5s7vdb2mA/edit?usp=sharing
> > to collect some ideas and notes. Once we have gathered our thoughts
> > there, we can put them in the appropriate places.
> >
> > I think that some of the result will go into the FAQ, some into
> > documentation (maybe more "how-to" and "getting started" guides in the
> > respective language docs, as well as some "how to share Arrow data
> > from X to Y"), and other things that we haven't yet done should go
> > into a sort of Roadmap document on the main website. We have some very
> > outdated content related to a roadmap on the confluence wiki that
> > should be folded in as appropriate too.
> >
> > Neal
> >
> > On Thu, Sep 19, 2019 at 10:26 AM Uwe L. Korn  wrote:
> > >
> > > Hello,
> > >
> > > there has been a lot of public discussions lately with some mentions of
> > actually informed, valid critique of things in the Arrow project. From my
> > perspective, these things include "there is not STL-native C++ Arrow API",
> > "the base build requires too much dependencies", "the pyarrow package is
> > really huge and you cannot select single components". These are things we
> > cannot tackle at the moment due to the lack of contributors to the project.
> > But we can use this as a basis to point people that critique the project on
> > this that this is not intentional but a lack of resources as well as it
> > provides another point of entry for new contributors looking for work.
> > >
> > > Thus I would like to start a document (possibly on the website) where we
> > list the major critiques on Arrow, mention our long-term solution to that
> > and what JIRAs need to be done for that.
> > >
> > > Would that be something others would also see as valuable?
> > >
> > > There has also been a lot of uninformed criticism, I think that can be
> > best combat by documentation, blog posts and public appearances at
> > conferences and is not covered by this proposal.
> > >
> > > Uwe
> >
>


Collecting Arrow critique and our roadmap on that

2019-09-19 Thread Uwe L. Korn
Hello,

there has been a lot of public discussions lately with some mentions of 
actually informed, valid critique of things in the Arrow project. From my 
perspective, these things include "there is not STL-native C++ Arrow API", "the 
base build requires too much dependencies", "the pyarrow package is really huge 
and you cannot select single components". These are things we cannot tackle at 
the moment due to the lack of contributors to the project. But we can use this 
as a basis to point people that critique the project on this that this is not 
intentional but a lack of resources as well as it provides another point of 
entry for new contributors looking for work.

Thus I would like to start a document (possibly on the website) where we list 
the major critiques on Arrow, mention our long-term solution to that and what 
JIRAs need to be done for that.

Would that be something others would also see as valuable?

There has also been a lot of uninformed criticism, I think that can be best 
combat by documentation, blog posts and public appearances at conferences and 
is not covered by this proposal.

Uwe


Re: [DISCUSS] C-level in-process array protocol

2019-09-19 Thread Uwe L. Korn
Hello,

I like this proposal as it will make interfacing inside a process between 
various Arrow supports much easier. I'm a bit critical though of using a string 
as the format representation as one needs to parse it correctly. Couldn't we 
use the enums we already have and reimplement them as C-defines instead?

Uwe

On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote:
> Hi Antoine,
> 
> I'm also interested in a stable ABI (previously I posted on this mailing
> list about the ABI issues I had [1]). Does having such an ABI-stable
> C-struct imply that there will be a set of C-APIs exposed by the Arrow
> (C++) library (which I think would lead to a solution to all the inherit
> ABI issues caused by C++)?
> 
> [1]
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> 
> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou  wrote:
> 
> >
> > Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > > I like the idea of a stable ABI for in-processing  that can be used for
> > in
> > > process communication.  For instance, there was a recent question on
> > > stack-overflow on how to solve this [1].
> > >
> > > A couple of thoughts/questions:
> > > * Would ArrowArray also need a self reference for children arrays?
> >
> > Yes, I forgot that.  I also think we don't need a separate Buffer
> > struct, instead the Array struct should own all its buffers.
> >
> > > * Should transferring key-value metadata be in scope?
> >
> > Yes.  It could either be in the format string or a separate string.  The
> > upside of a separate string is that a consumer may ignore it trivially
> > if it doesn't need the information.
> >
> > Another open question is for nested types: does the format string
> > represent the entire type including children?  Or must child types be
> > read in the child arrays?  If we mimick ArrayData, then the format
> > string should represent the entire type; it will then be more complex to
> > parse.
> >
> > We should also make sure that extension types fit in the protocol.
> >
> > > * Should the API more closely align the IPC spec (pass a schema
> > separately
> > > and list of buffers instead of individual arrays)?
> >
> > Then you have that's not immediately usable (you have to do some
> > processing to reconstitute the individual arrays).  One goal here is to
> > minimize implementation costs for producers and consumers.  The
> > assumption is a data model similar to the C++ ArrowData model; do we
> > have implementations that use an entirely different model?  Perhaps I
> > should take a look :-)
> >
> > Note that the draft I posted only concerns arrays.  We may also want to
> > have a C struct for batches or tables.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> > >
> > > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >> Hello,
> > >>
> > >> One thing that was discussed in the sync call is the ability to easily
> > >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> > >> libraries in the same process, without bearing the cost of linking to
> > >> e.g. the C++ Arrow library.
> > >>
> > >> (for example: "Duckdb wants to provide an option to return Arrow data of
> > >> result sets, but they don't like having Arrow as a dependency")
> > >>
> > >> One possibility would be to define a C-level protocol similar in spirit
> > >> to the Python buffer protocol, which some people may be familiar with
> > (*).
> > >>
> > >> The basic idea is to define a simple C struct, which is ABI-stable and
> > >> describes an Arrow away adequately.  The struct can be stack-allocated.
> > >> Its definition can also be copied in another project (or interfaced with
> > >> using a C FFI layer, depending on the language).
> > >>
> > >> There is no formal proposal, this message is meant to stir the
> > discussion.
> > >>
> > >> Issues to work out:
> > >>
> > >> * Memory lifetime issues: where Python simply associates the Py_buffer
> > >> with a PyObject owner (a garbage-collected Python object), we need
> > >> another means to control lifetime of pointed areas.  One simple
> > >> possibility is to include a destructor function pointer in the protocol
> > >> struct.
> > >>
> > >> * Arrow type representation.  We probably need some kind of "format"
> > >> mini-language to represent Arrow types, so that a type can be described
> > >> using a `const char*`.  Ideally, primitives types at least should be
> > >> trivially parsable.  We may take inspiration from Python here (`struct`
> > >> module format characters, PEP 3118 format additions).
> > >>
> > >> Example C struct definition (not a formal proposal!):
> > >>
> > >> struct ArrowBuffer {
> > >>   void* data;
> > >>   int64_t nbytes;
> > >>   // Called by the consumer when it 

Re: Build issues on macOS [newbie]

2019-09-19 Thread Uwe L. Korn
Hello Tarek,

this error message is normally the one you get when CONDA_BUILD_SYSROOT doesn't 
point to your 10.9 SDK. Please delete your build folder again and do `export 
CONDA_BUILD_SYSROOT=..` immediately before running cmake. Running e.g. a conda 
install will sadly reset this variable to something different and break the 
build.

As a sidenote: It looks like in 1-2 months that conda-forge will get rid of the 
SDK requirement, then this will be a bit simpler.

Cheers
Uwe

On Thu, Sep 19, 2019, at 5:24 PM, Tarek Allam Jr. wrote:
> 
> Hi all,
> 
> Firstly I must apologies if what I put here is extremely trivial, but I am a
> complete newcomer to the Apache Arrow project and contributing to Apache in
> general, but I am very keen to get involved.
> 
> I'm hoping to help where I can so I recently attempted to complete a build
> following the instructions laid out in the 'Python Development' section of the
> documentation here:
> 
> After completing the steps that specifically uses Conda I was able to create 
> an
> environment but when it comes to building I am unable to do so.
> 
> I am on macOS -- 10.14.6 and as outlined in the docs and here 
> (https://stackoverflow.com/a/55798942/4521950) I used use 10.9.sdk 
> instead
> of the latest. I have both added this manually using ccmake and also 
> defining it
> like so:
> 
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>   -DCMAKE_INSTALL_LIBDIR=lib \
>   -DARROW_FLIGHT=ON \
>   -DARROW_GANDIVA=ON \
>   -DARROW_ORC=ON \
>   -DARROW_PARQUET=ON \
>   -DARROW_PYTHON=ON \
>   -DARROW_PLASMA=ON \
>   -DARROW_BUILD_TESTS=ON \
>   -DCONDA_BUILD_SYSROOT=/opt/MacOSX10.9.sdk \
>   -DARROW_DEPENDENCY_SOURCE=AUTO \
>   ..
> 
> But it seems that whatever I try, I seem to get errors, the main only tripping
> me up at the moment is:
> 
> -- Building using CMake version: 3.15.3
> -- The C compiler identification is Clang 4.0.1
> -- The CXX compiler identification is Clang 4.0.1
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang
> -- Check for working C compiler: 
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -- broken
> CMake Error at 
> /usr/local/anaconda3/envs/pyarrow-dev/share/cmake-3.15/Modules/CMakeTestCCompiler.cmake:60
>  (message):
>   The C compiler
> 
> "/usr/local/anaconda3/envs/pyarrow-dev/bin/clang"
> 
>   is not able to compile a simple test program.
> 
>   It fails with the following output:
> 
> Change Dir: /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp
> 
> Run Build Command(s):/usr/local/bin/gmake cmTC_b252c/fast && 
> /usr/local/bin/gmake -f CMakeFiles/cmTC_b252c.dir/build.make 
> CMakeFiles/cmTC_b252c.dir/build
> gmake[1]: Entering directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> Building C object CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang   -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE 
> -fstack-protector-strong -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk   -o 
> CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o   -c 
> /Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp/testCCompiler.c
> Linking C executable cmTC_b252c
> /usr/local/anaconda3/envs/pyarrow-dev/bin/cmake -E 
> cmake_link_script CMakeFiles/cmTC_b252c.dir/link.txt --verbose=1
> /usr/local/anaconda3/envs/pyarrow-dev/bin/clang -march=core2 
> -mtune=haswell -mssse3 -ftree-vectorize -fPIC -fPIE 
> -fstack-protector-strong -O2 -pipe  -isysroot /opt/MacOSX10.9.sdk 
> -Wl,-search_paths_first -Wl,-headerpad_max_install_names -Wl,-pie 
> -Wl,-headerpad_max_install_names -Wl,-dead_strip_dylibs  
> CMakeFiles/cmTC_b252c.dir/testCCompiler.c.o  -o cmTC_b252c
> ld: warning: ignoring file 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
>  file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
> 0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
> architecture being linked (x86_64): 
> /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd
> ld: dynamic main executables must link with libSystem.dylib for 
> architecture x86_64
> clang-4.0: error: linker command failed with exit code 1 (use -v to 
> see invocation)
> gmake[1]: *** [CMakeFiles/cmTC_b252c.dir/build.make:87: cmTC_b252c] 
> Error 1
> gmake[1]: Leaving directory 
> '/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeTmp'
> gmake: *** [Makefile:121: cmTC_b252c/fast] Error 2
> 
> 
>   CMake will not be able to correctly generate this project.
> Call Stack (most recent call first):
>   CMakeLists.txt:32 (project)
> 
> -- Configuring incomplete, errors occurred!
> See also "/Users/tallamjr/Github/arrow/cpp/build/CMakeFiles/CMakeOutput.log".
> See also 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello Micah,

I don't think we have explored using bazel yet. I would see it as a possible 
modular alternative but as you mention it will be a lot of work and we would 
probably need a mentor who is familiar with bazel, otherwise we probably end up 
spending too much time on this and get a non-typical bazel setup.

Uwe

On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> It has come up in the past, but I wonder if exploring Bazel as a build
> system with its a very explicit dependency graph might help (I'm not sure
> if something similar is available in CMake).
> 
> This is also a lot of work, but could also potentially benefit the
> developer experience because we can make unit tests depend on individual
> compilable units instead of all of libarrow.  There are trade-offs here as
> well in terms of public API coverage.
> 
> On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:
> 
> > Hello,
> >
> > I can think of two other alternatives that make it more visible what Arrow
> > core is and what are the optional components:
> >
> > * Error out when no component is selected instead of building just the
> > core Arrow. Here we could add an explanative message that list all
> > components and for each component 2-3 words what it does and what it
> > requires. This would make the first-time experience much better.
> > * Split the CMake project into several subprojects. By correctly
> > structuring the CMakefiles, we should be able to separate out the Arrow
> > components into separate CMake projects that can be built independently if
> > needed while all using the same third-party toolchain. We would still have
> > a top-level CMakeLists.txt that is invoked just like the current one but
> > through having subprojects, you would not anymore be bound to use the
> > single top-level one. This would also have some benefit for packagers that
> > could separate out the build of individual Arrow modules. Furthermore, it
> > would also make it easier for PoC/academic projects to just take the Arrow
> > Core sources and drop it in as a CMake subproject; while this is not a good
> > solution for production-grade software, it is quite common practice to do
> > this in research.
> > I really like this approach and I think this is something we should have
> > as a long-term target, I'm also happy to implement given the time but I
> > think one CMake refactor per year is the maximum I can do and that was
> > already eaten up by the dependency detection. Also, I'm unsure about how
> > much this would block us at the moment vs the marketing benefit of having a
> > more modular Arrow; currently I'm leaning on the side that the
> > marketing/adoption benefit would be much larger but we lack someone
> > frustration-tolerant to do the refactoring.
> >
> > Uwe
> >
> > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > Lately there seem to be more and more people suggesting that the
> > > optional components in the Arrow C++ project are getting in the way of
> > > using the "core" which implements the columnar format and IPC
> > > protocol. I am not sure I agree with this argument, but in general I
> > > think it would be a good idea to make all optional components in the
> > > project "opt in" rather than "opt out"
> > >
> > > To demonstrate where things currently stand, I created a Dockerfile to
> > > try to make the smallest possible and most dependency-free build
> > >
> > >
> > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > >
> > > Here is the output of this build
> > >
> > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > >
> > > First, let's look at the CMake invocation
> > >
> > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > -DARROW_BOOST_USE_SHARED=OFF \
> > > -DARROW_COMPUTE=OFF \
> > > -DARROW_DATASET=OFF \
> > > -DARROW_JEMALLOC=OFF \
> > > -DARROW_JSON=ON \
> > > -DARROW_USE_GLOG=OFF \
> > > -DARROW_WITH_BZ2=OFF \
> > > -DARROW_WITH_ZLIB=OFF \
> > > -DARROW_WITH_ZSTD=OFF \
> > > -DARROW_WITH_LZ4=OFF \
> > > -DARROW_WITH_SNAPPY=OFF \
> > > -DARROW_WITH_BROTLI=OFF \
> > > -DARROW_BUILD_UTILITIES=OFF
> > >
> > > Aside from the issue of how to obtain and link Boost, here's a couple of
> > things:
> > >
> > > * COMPUTE and DATASET IMHO should be off by default
> > > * All compression libraries should be turned off
> >

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello,

I can think of two other alternatives that make it more visible what Arrow core 
is and what are the optional components:

* Error out when no component is selected instead of building just the core 
Arrow. Here we could add an explanative message that list all components and 
for each component 2-3 words what it does and what it requires. This would make 
the first-time experience much better.
* Split the CMake project into several subprojects. By correctly structuring 
the CMakefiles, we should be able to separate out the Arrow components into 
separate CMake projects that can be built independently if needed while all 
using the same third-party toolchain. We would still have a top-level 
CMakeLists.txt that is invoked just like the current one but through having 
subprojects, you would not anymore be bound to use the single top-level one. 
This would also have some benefit for packagers that could separate out the 
build of individual Arrow modules. Furthermore, it would also make it easier 
for PoC/academic projects to just take the Arrow Core sources and drop it in as 
a CMake subproject; while this is not a good solution for production-grade 
software, it is quite common practice to do this in research.
I really like this approach and I think this is something we should have as a 
long-term target, I'm also happy to implement given the time but I think one 
CMake refactor per year is the maximum I can do and that was already eaten up 
by the dependency detection. Also, I'm unsure about how much this would block 
us at the moment vs the marketing benefit of having a more modular Arrow; 
currently I'm leaning on the side that the marketing/adoption benefit would be 
much larger but we lack someone frustration-tolerant to do the refactoring.

Uwe

On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> hi folks,
> 
> Lately there seem to be more and more people suggesting that the
> optional components in the Arrow C++ project are getting in the way of
> using the "core" which implements the columnar format and IPC
> protocol. I am not sure I agree with this argument, but in general I
> think it would be a good idea to make all optional components in the
> project "opt in" rather than "opt out"
> 
> To demonstrate where things currently stand, I created a Dockerfile to
> try to make the smallest possible and most dependency-free build
> 
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> 
> Here is the output of this build
> 
> https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> 
> First, let's look at the CMake invocation
> 
> cmake .. -DBOOST_SOURCE=BUNDLED \
> -DARROW_BOOST_USE_SHARED=OFF \
> -DARROW_COMPUTE=OFF \
> -DARROW_DATASET=OFF \
> -DARROW_JEMALLOC=OFF \
> -DARROW_JSON=ON \
> -DARROW_USE_GLOG=OFF \
> -DARROW_WITH_BZ2=OFF \
> -DARROW_WITH_ZLIB=OFF \
> -DARROW_WITH_ZSTD=OFF \
> -DARROW_WITH_LZ4=OFF \
> -DARROW_WITH_SNAPPY=OFF \
> -DARROW_WITH_BROTLI=OFF \
> -DARROW_BUILD_UTILITIES=OFF
> 
> Aside from the issue of how to obtain and link Boost, here's a couple of 
> things:
> 
> * COMPUTE and DATASET IMHO should be off by default
> * All compression libraries should be turned off
> * GLOG should be off by default
> * Utilities should be off (they are used for integration testing)
> * Jemalloc should probably be off, but we should make it clear that
> opting in will yield better performance
> 
> I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> the build. I opened ARROW-6590 to fix this
> 
> Aside from potentially changing these defaults, there's some things in
> the build that we might want to turn into optional pieces:
> 
> * We should see if we can make boost::filesystem not mandatory in the
> barebones build, if only to satisfy the peanut gallery
> * double-conversion is used in the CSV module. I think that
> double-conversion_ep and the CSV module should both be made opt-in
> * rapidjson_ep should be made optional. JSON support is only needed
> for integration testing
> 
> We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> is not mandatory.
> 
> In general, enabling optional components is primarily relevant for
> packagers. If we implement these changes, a number of package build
> scripts will have to change.
> 
> Thanks,
> Wes
>


Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so / .dll) approach

2019-09-17 Thread Uwe L. Korn
Hello,

I'm actually against this proposal.

My main concern is at the moment that Arrow C++/Python grows to a really heavy 
tool where you always have to bring along all baggage even when you're only 
using a small part of it. This is a problem which makes it harder to use Arrow 
in projects because:

* Simply the sheer size, the more dependencies the full build has, we grow 
further in the size of the installable.
* Having a large number of dependencies also means that you will need to take 
care of security scanning of all of these in production settings. Even when 
you're not using the parts, you will need to check for version updates, correct 
licenses and origin of the dependencies. Having a more modular is much simpler 
than mastering the art of convincing corporate IT.
* Defining dependencies from third-party libraries gets less transperant. When 
a library depends just on a large libarrow.so and starts with a missing symbol 
error, a user is confused and might think that the Arrow installation is 
corrupt whereas if the error reports that libarrow_flight.so is missing, he is 
much more aware that his local build is one without Flight being built.

I would actually like to see the pyarrow packages split up into several 
packages in the future, making the C++ part a single shared object would quite 
hinder this. I don't have the resources to move forward with this now but as I 
know that I will need this, I'm going to want to implement this sometime.

Uwe

On Tue, Sep 17, 2019, at 6:22 AM, Micah Kornfield wrote:
> I don't have a strong opinion here, but had a question and comment:
> 
> Are there are implications from a project governance perspective of
> packaging Parquet and Arrow into a single shared library?
> 
> As a comment, but I'm a big +1 on trying to tease apart the circular
> dependencies between Parquet/Arrow (and any other modules).  As noted
> above, I think this boils down to isolating IO and Buffer data structures
> into 1 library and having the Arrow Array data structures in their own
> separate libraries.
> 
> Thanks,
> Micah
> 
> On Mon, Sep 16, 2019 at 7:35 PM Sutou Kouhei  wrote:
> 
> > Hi,
> >
> > If this is circular, it's a problem. But this isn't circular
> > for now.
> >
> > I think that we can use libarrow as the fundamental shared
> > library to provide common implementation like [1] if we need
> > to provide common implementation for template. (I think that
> > we don't provide common implementation for template.)
> >
> > [1]
> > https://github.com/apache/arrow/pull/5221/commits/e88b2579f04451d741eeddcb6697914bcc1019a6
> >
> > Anyway, I'm not strongly oppose to this idea. If we choose
> > one shared library approach, Linux packages, GLib bindings
> > and Ruby bindings can follow the change.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In 
> >   "Re: [DISCUSS][C++] Rethinking our current C++ shared library (.so /
> > .dll) approach" on Thu, 12 Sep 2019 13:23:01 -0500,
> >   Wes McKinney  wrote:
> >
> > > One thing I forgot to mention:
> > >
> > > One of the things driving the creation of new shared libraries is
> > > interdependencies. For example:
> > >
> > > libarrow -> libparquet
> > > libarrow -> libarrow_dataset
> > > libparquet -> libarrow_dataset
> > >
> > > With the modular LLVM-like approach this issue goes away.
> > >
> > > On Thu, Sep 12, 2019 at 1:16 PM Wes McKinney 
> > wrote:
> > >>
> > >> I forgot to add the link to the LLVM library listing
> > >>
> > >> https://gist.github.com/wesm/d13c2844db0c19477e8ee5c95e36a0dc
> > >>
> > >> On Thu, Sep 12, 2019 at 1:14 PM Wes McKinney 
> > wrote:
> > >> >
> > >> > hi folks,
> > >> >
> > >> > I wanted to share some concerns that I have about our current
> > >> > trajectory with regards to producing shared libraries from the Arrow
> > >> > build system.
> > >> >
> > >> > Currently, a comprehensive build produces many shared libraries:
> > >> >
> > >> > * libarrow
> > >> > * libarrow_dataset
> > >> > * libarrow_flight
> > >> > * libarrow_python
> > >> > * libgandiva
> > >> > * libparquet
> > >> > * libplasma
> > >> >
> > >> > There are some others. There are a number of problems with the
> > current approach:
> > >> >
> > >> > * Each DLL needs its own set of "visibility" macros to control the use
> > >> > of __declspec(dllimport/dllexport) on Windows, which is necessary to
> > >> > instruct the import or export of symbols between DLLs on Windows. See
> > >> > e.g.
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/visibility.h
> > >> >
> > >> > * Templates instantiated in one DLL may cause a violation of the One
> > >> > Definition Rule during linking (we lost at least a day of work time
> > >> > collectively to issues around this in ARROW-6244). It is good to be
> > >> > able to share common template interfaces in general
> > >> >
> > >> > * Statically-linked dependencies in one shared lib may need to be
> > >> > statically linked into another library. For example, libgandiva
> > >> > statically links 

Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Uwe L. Korn
Hello Krisztián,

> Am 05.09.2019 um 14:22 schrieb Krisztián Szűcs :
> 
>> * The build configuration is automatically updated on a merge to master?
>> 
> Not yet, but this can be automatized too with buildbot itself.

This is something I would  actually like to have before getting rid of the 
Travis jobs. Otherwise we would be constrainted quite a bit in development when 
master CI breaks because of an environment issue until one of the few people 
who can update the config become available.

Uwe 


> 
>> 
>> And then a not so simple one: What will happen to our current
>> docker-compose setup? From the PR it seems like we do similar things with
>> ursabot but not using the central docker-compose.yml?
>> 
> Currently we're using docker-compose to run one-off containers rather
> than long running, multi-container services (which docker-compose is
> designed for). Ursabot already supports the features we need from
> docker-compose, so it can effectively replace the docker-compose
> setup as well. We have low-level control over the docker API, so we
> are able to tailor it to our requirements.
> 
>> 
>> 
>> Cheers
>> Uwe
>> 
>>> Am 29.08.2019 um 14:19 schrieb Krisztián Szűcs <
>> szucs.kriszt...@gmail.com>:
>>> 
>>> Hi,
>>> 
>>> Arrow's current continuous integration setup utilizes multiple CI
>>> providers,
>>> tools, and scripts:
>>> 
>>> - Unit tests are running on Travis and Appveyor
>>> - Binary packaging builds are running on crossbow, an abstraction over
>>> multiple
>>>  CI providers driven through a GitHub repository
>>> - For local tests and tasks, there is a docker-compose setup, or of
>> course
>>> you
>>>  can maintain your own environment
>>> 
>>> This setup has run into some limitations:
>>> - It’s slow: the CI parallelism of Travis has degraded over the last
>>> couple of
>>>  months. Testing a PR takes more than an hour, which is a long time for
>>> both
>>>  the maintainers and the contributors, and it has a negative effect on
>>> the
>>>  development throughput.
>>> - Build configurations are not portable, they are tied to specific
>>> services.
>>>  You can’t just take a Travis script and run it somewhere else.
>>> - Because they’re not portable, build configurations are duplicated in
>>> several
>>>  places.
>>> - The Travis, Appveyor and crossbow builds are not reproducible locally,
>>> so
>>>  developing them requires the slow git push cycles.
>>> - Public CI has limited platform support, just for example ARM machines
>>> are
>>>  not available.
>>> - Public CI also has limited hardware support, no GPUs are available
>>> 
>>> Resolving all of the issues above is complicated, but is a must for the
>>> long
>>> term sustainability of Arrow.
>>> 
>>> For some time, we’ve been working on a tool called Ursabot[1], a library
>> on
>>> top
>>> of the CI framework Buildbot[2]. Buildbot is well maintained and widely
>>> used
>>> for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
>>> Buildbot
>>> is not another hosted CI service like Travis or Appveyor: it is an
>>> extensible
>>> framework to implement various automations like continuous integration
>>> tasks.
>>> 
>>> You’ve probably noticed additional “Ursabot” builds appearing on pull
>>> requests,
>>> in addition to the Travis and Appveyor builds. We’ve been testing the
>>> framework
>>> with a fully featured CI server at ci.ursalabs.org. This service runs
>> build
>>> configurations we can’t run on Travis, does it faster than Travis, and
>> has
>>> the
>>> GitHub comment bot integration for ad hoc build triggering.
>>> 
>>> While we’re not prepared to propose moving all CI to a self-hosted setup,
>>> our
>>> work has demonstrated the potential of using buildbot to resolve Arrow’s
>>> continuous integration challenges:
>>> - The docker-based builders are reusing the docker images, which
>> eliminate
>>>  slow dependency installation steps. Some builds on this setup, run on
>>>  Ursa Labs’s infrastructure, run 20 minutes faster than the comparable
>>>  Travis-CI jobs.
>>> - It’s scalable. We can deploy buildbot wherever and add more masters and
>>>  workers, which we can’t do with public CI.
>>> - It’s platform and CI-provider independent. Builds can be run on
>>> arbitrary
>>>  architectures, operating systems, and hardware: Python is the only
>>>  requirement. Additionally builds specified in buildbot/ursabot can be
>>> run
>>>  anywhere: not only on custom buildbot infrastructure but also on
>> Travis,
>>> or
>>>  even on your own machine.
>>> - It improves reproducibility and encourages consolidation of
>>> configuration.
>>>  You can run the exact job locally that ran on Travis, and you can even
>>> get
>>>  an interactive shell in the build so you can debug a test failure. And
>>>  because you can run the same job anywhere, we wouldn’t need to have
>>>  duplicated, Travis-specific or the docker-compose build configuration
>>> stored
>>>  separately.
>>> - It’s extensible. More exotic features like a comment bot, 

Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Uwe L. Korn
Hello Krisztián, 

I like this proposal. CI coverage and response time is a crucial thing for the 
health of the project. In general I like the consolidation and local 
reproducibility of tge builds. Some questions I wanted to ask to make sure I 
understand your proposal correctly (hopefully they all can be answered with a 
simple yes):

* Windows builds will stay in Appveyor for now?
* MacOS builds will stay in Travis?
* All other builds will be removed from Travis?
* Machines are currently run and funded by UrsaLabs but others could also 
sponsor an instance that could be added to the setup?
* The build configuration is automatically updated on a merge to master?

And then a not so simple one: What will happen to our current docker-compose 
setup? From the PR it seems like we do similar things with ursabot but not 
using the central docker-compose.yml?


Cheers
Uwe

> Am 29.08.2019 um 14:19 schrieb Krisztián Szűcs :
> 
> Hi,
> 
> Arrow's current continuous integration setup utilizes multiple CI
> providers,
> tools, and scripts:
> 
> - Unit tests are running on Travis and Appveyor
> - Binary packaging builds are running on crossbow, an abstraction over
> multiple
>   CI providers driven through a GitHub repository
> - For local tests and tasks, there is a docker-compose setup, or of course
> you
>   can maintain your own environment
> 
> This setup has run into some limitations:
> - It’s slow: the CI parallelism of Travis has degraded over the last
> couple of
>   months. Testing a PR takes more than an hour, which is a long time for
> both
>   the maintainers and the contributors, and it has a negative effect on
> the
>   development throughput.
> - Build configurations are not portable, they are tied to specific
> services.
>   You can’t just take a Travis script and run it somewhere else.
> - Because they’re not portable, build configurations are duplicated in
> several
>   places.
> - The Travis, Appveyor and crossbow builds are not reproducible locally,
> so
>   developing them requires the slow git push cycles.
> - Public CI has limited platform support, just for example ARM machines
> are
>   not available.
> - Public CI also has limited hardware support, no GPUs are available
> 
> Resolving all of the issues above is complicated, but is a must for the
> long
> term sustainability of Arrow.
> 
> For some time, we’ve been working on a tool called Ursabot[1], a library on
> top
> of the CI framework Buildbot[2]. Buildbot is well maintained and widely
> used
> for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
> Buildbot
> is not another hosted CI service like Travis or Appveyor: it is an
> extensible
> framework to implement various automations like continuous integration
> tasks.
> 
> You’ve probably noticed additional “Ursabot” builds appearing on pull
> requests,
> in addition to the Travis and Appveyor builds. We’ve been testing the
> framework
> with a fully featured CI server at ci.ursalabs.org. This service runs build
> configurations we can’t run on Travis, does it faster than Travis, and has
> the
> GitHub comment bot integration for ad hoc build triggering.
> 
> While we’re not prepared to propose moving all CI to a self-hosted setup,
> our
> work has demonstrated the potential of using buildbot to resolve Arrow’s
> continuous integration challenges:
> - The docker-based builders are reusing the docker images, which eliminate
>   slow dependency installation steps. Some builds on this setup, run on
>   Ursa Labs’s infrastructure, run 20 minutes faster than the comparable
>   Travis-CI jobs.
> - It’s scalable. We can deploy buildbot wherever and add more masters and
>   workers, which we can’t do with public CI.
> - It’s platform and CI-provider independent. Builds can be run on
> arbitrary
>   architectures, operating systems, and hardware: Python is the only
>   requirement. Additionally builds specified in buildbot/ursabot can be
> run
>   anywhere: not only on custom buildbot infrastructure but also on Travis,
> or
>   even on your own machine.
> - It improves reproducibility and encourages consolidation of
> configuration.
>   You can run the exact job locally that ran on Travis, and you can even
> get
>   an interactive shell in the build so you can debug a test failure. And
>   because you can run the same job anywhere, we wouldn’t need to have
>   duplicated, Travis-specific or the docker-compose build configuration
> stored
>   separately.
> - It’s extensible. More exotic features like a comment bot, benchmark
>   database, benchmark dashboard, artifact store, integrating other systems
> are
>   easily implementable within the same system.
> 
> I’m proposing to donate the build configuration we’ve been iterating on in
> Ursabot to the Arrow codebase. Here [3] is a patch that adds the
> configuration.
> This will enable us to explore consolidating build configuration using the
> buildbot framework. A next step after to explore that would be to port a
> Travis
> 

Re: Parquet to Arrow in Java

2019-09-04 Thread Uwe L. Korn
Hello,

You may want to interact with the Apache Iceberg community here. They are 
currently a similar things: 
https://lists.apache.org/thread.html/3bb4f89a0b37f474cf67915f91326fa845afa597bdd2463c98a2c8b9@%3Cdev.iceberg.apache.org%3E
 I'm not involved in this, just reading both mailing lists and thought I'd 
share this.

Cheers
Uwe

On Wed, Sep 4, 2019, at 7:24 PM, Chao Sun wrote:
> Bumping this.
> 
> We may have an upcoming use case for this as well. Want to know if anyone
> is actively working on this? I also heard that Dremio has internally
> implemented a performant Parquet to Arrow reader. Is there any plan to open
> source it? that could save us a lot of work.
> 
> Thanks,
> Chao
> 
> On Fri, Aug 9, 2019 at 8:49 AM Renjie Liu  wrote:
> 
> > Hi:
> >
> > I'm working on the rust part and expecting to finish this recently. I'm
> > also interested in the java version because we are trying to embed arrow in
> > spark to implement vectorized processing. Maybe we can work together.
> >
> > Micah Kornfield  于 2019年8月5日周一 下午1:50写道:
> >
> > > Hi Anoop,
> > > I think a contribution would be welcome.  There was a recent discussion
> > > thread on what would be expected from new "readers" for Arrow data in
> > Java
> > > [1].  I think its worth reading through but my recollections of the
> > > highlights are:
> > > 1.  A short design sketch in the JIRA that will track the work.
> > > 2.  Off-heap data-structures as much as possible
> > > 3.  An interface that allows predicate push down, column projection and
> > > specifying the batch sizes of reads.  I think there is probably some
> > > interplay here between RowGroup size and size of batches.  It might worth
> > > thinking about this up front and mentioning in the design.
> > > 4.  Performant (since we care going from columnar->columar it should be
> > > faster then Parquet-MR and on-par or better then Spark's implementation
> > > which I believe also goes from columnar to columnar).
> > >
> > > Answers to specific questions below.
> > >
> > > Thanks,
> > > Micah
> > >
> > > To help me get started, are there any pointers on how the C++ or Rust
> > > > implementations currently read Parquet into Arrow?
> > >
> > > I'm not sure about the Rust code, but the C++ code is located at [2], it
> > is
> > > has been going under some recent refactoring (and I think Wes might have
> > 1
> > > or 2 changes till to make).  It doesn't yet support nested data types
> > fully
> > > (e.g. structs).
> > >
> > > Are they reading Parquet row-by-row and building Arrow batches or are
> > there
> > > > better ways of implementing this?
> > >
> > > I believe the implementations should be reading a row-group at a time
> > > column by column.  Spark potentially has an implementation that already
> > > does this.
> > >
> > >
> > > [1]
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.apache.org_thread.html_b096528600e66c17af9498c151352f12944ead2fd218a0257fdd4f70-40-253Cdev.arrow.apache.org-253E=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=61WUGQdh6WmssHGmn9FHXw=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA=fKQYzdomRi1K0lheZFD-gZ59TaHAGaBDzJFApTekkt0=
> > > [2]
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_tree_master_cpp_src_parquet_arrow=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=61WUGQdh6WmssHGmn9FHXw=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA=7AhyWgqaneMxlWeFA3EKYaygl0RpkC-nVSungqZaVqg=
> > >
> > > On Sun, Aug 4, 2019 at 2:52 PM Anoop Johnson 
> > > wrote:
> > >
> > > > Thanks for the response Micah. I could implement this and contribute to
> > > > Arrow Java. To help me get started, are there any pointers on how the
> > C++
> > > > or Rust implementations currently read Parquet into Arrow? Are they
> > > reading
> > > > Parquet row-by-row and building Arrow batches or are there better ways
> > of
> > > > implementing this?
> > > >
> > > > On Tue, Jul 30, 2019 at 1:56 PM Micah Kornfield  > >
> > > > wrote:
> > > >
> > > >> Hi Anoop,
> > > >> There isn't currently anything in the Arrow Java library that does
> > this.
> > > >> It is something that I think we want to add at some point.   Dremio
> > [1]
> > > >> has
> > > >> some Parquet related code, but I haven't looked at it to understand
> > how
> > > >> easy it is to use as a standalone library and whether is supports
> > > >> predicate
> > > >> push-down/column selection.
> > > >>
> > > >> Thanks,
> > > >> Micah
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dremio_dremio-2Doss_tree_master_sabot_kernel_src_main_java_com_dremio_exec_store_parquet=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=61WUGQdh6WmssHGmn9FHXw=JP24vxBLI1Pwr36FB0vYB769jlgftgc8Bik-P09fJZA=DHkMQ-raZ__SnOH71hieIiLMlPYdEWY_7pqeMowp6wU=
> > > >>
> > > >> On Sun, Jul 28, 2019 at 2:08 PM Anoop Johnson <
> > > anoop.k.john...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Arrow Newbie here.  What is the recommended way to convert Parquet
> > > data
> > > >> > into Arrow, preferably doing 

Re: Trouble building on Mac OS Mojave

2019-08-31 Thread Uwe L. Korn
Hello Chris,

as a contributor, it is often simpler to use conda to construct a local 
development environment as outlined in 
https://arrow.apache.org/docs/developers/python.html#using-conda
This is the typical environment most contributors work in. Even when not using 
conda as a package/environment manager elsewhere, I would recommend to use it 
to setup your Arrow build environment as this is the way most developers do. 
Thus it will be easier to help you and this is the setup we (try to) maintain 
best.

Cheers
Uwe

On Sat, Aug 31, 2019, at 3:48 PM, Chris Teoh wrote:
> Does this approach fit with potentially a contributor's workflow? I was
> looking into contributing though I'm unsure if I am doing it right.
> 
> On Sat, 31 Aug 2019 at 22:22, Jeroen Ooms  wrote:
> 
> > On Sat, Aug 31, 2019 at 4:48 AM Chris Teoh  wrote:
> > >
> > > That being said, is there an easier way by using a Docker container I
> > could
> > > use to build this in?
> >
> > An easy way to install arrow on MacOS is using homebrew. To get a
> > precompiled version of the latest release:
> >
> >   brew install apache-arrow
> >
> > Or to build the master branch from source:
> >
> >brew install apache-arrow --HEAD
> >
> > If you want to customize the configuration use "brew edit
> > apache-arrow" before building from source.
> >
> 
> 
> -- 
> Chris
>


Re: [VOTE] Adopt FORMAT and LIBRARY SemVer-based version schemes for Arrow 1.0.0 and beyond

2019-07-31 Thread Uwe L. Korn
+1 from me.

I really like the separate versions

Uwe

On Tue, Jul 30, 2019, at 2:21 PM, Antoine Pitrou wrote:
> 
> +1 from me.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> On Fri, 26 Jul 2019 14:33:30 -0500
> Wes McKinney  wrote:
> > hello,
> > 
> > As discussed on the mailing list thread [1], Micah Kornfield has
> > proposed a version scheme for the project to take effect starting with
> > the 1.0.0 release. See document [2] containing a discussion of the
> > issues involved.
> > 
> > To summarize my understanding of the plan:
> > 
> > 1. TWO VERSIONS: As of 1.0.0, we establish separate FORMAT and LIBRARY
> > versions. Currently there is only a single version number.
> > 
> > 2. SEMANTIC VERSIONING: We follow https://semver.org/ with regards to
> > communicating library API changes. Given the project's pace of
> > evolution, most releases are likely to be MAJOR releases according to
> > SemVer principles.
> > 
> > 3. RELEASES: Releases of the project will be named according to the
> > LIBRARY version. A major release may or may not change the FORMAT
> > version. When a LIBRARY version has been released for a new FORMAT
> > version, the latter is considered to be released and official.
> > 
> > 4. Each LIBRARY version will have a corresponding FORMAT version. For
> > example, LIBRARY versions 2.0.0 and 3.0.0 may track FORMAT version
> > 1.0.0. The idea is that FORMAT version will change less often than
> > LIBRARY version.
> > 
> > 5. BACKWARD COMPATIBILITY GUARANTEE: A newer versioned client library
> > will be able to read any data and metadata produced by an older client
> > library.
> > 
> > 6. FORWARD COMPATIBILITY GUARANTEE: An older client library must be
> > able to either read data generated from a new client library or detect
> > that it cannot properly read the data.
> > 
> > 7. FORMAT MINOR VERSIONS: An increase in the minor version of the
> > FORMAT version, such as 1.0.0 to 1.1.0, indicates that 1.1.0 contains
> > new features not available in 1.0.0. So long as these features are not
> > used (such as a new logical data type), forward compatibility is
> > preserved.
> > 
> > 8. FORMAT MAJOR VERSIONS: A change in the FORMAT major version
> > indicates a disruption to these compatibility guarantees in some way.
> > Hopefully we don't have to do this many times in our respective
> > lifetimes
> > 
> > If I've misrepresented some aspect of the proposal it's fine to
> > discuss more and we can start a new votes.
> > 
> > Please vote to approve this proposal. I'd like to keep this vote open
> > for 7 days (until Friday August 2) to allow for ample opportunities
> > for the community to have a look.
> > 
> > [ ] +1 Adopt these version conventions and compatibility guarantees as
> > of Apache Arrow 1.0.0
> > [ ] +0
> > [ ] -1 I disagree because...
> > 
> > Here is my vote: +1
> > 
> > Thanks
> > Wes
> > 
> > [1]: 
> > https://lists.apache.org/thread.html/5715a4d402c835d22d929a8069c5c0cf232077a660ee98639d544af8@%3Cdev.arrow.apache.org%3E
> > [2]: 
> > https://docs.google.com/document/d/1uBitWu57rDu85tNHn0NwstAbrlYqor9dPFg_7QaE-nc/edit#
> > 
> 
> 
> 
>


Re: [C++] Private implementations and virtual interfaces

2019-07-28 Thread Uwe L. Korn
Building CI that detects ABI breakage is not hard. There is a closed PR in the 
arrow repo from me that does exactly this using abi-complicance-checker.

I understand that we will not be able to provide ABI stability on all Arrow 
subprojects but having it for a core would be really great. This would allow 
easy upgrading of adaptors that connect to Arrow but don't actually work with 
the resulting structures (this can be database connectors and file format 
implementations).

Uwe

On Sun, Jul 28, 2019, at 10:43 AM, Antoine Pitrou wrote:
> 
> Le 28/07/2019 à 01:49, Wes McKinney a écrit :
> > On Sat, Jul 27, 2019 at 4:38 PM Uwe L. Korn  wrote:
> >>
> >> The PIMPL is a thing I would trade a bit of performance as it brings ABI 
> >> stability. This is something that will help us making Arrow usage in 
> >> thirdparty code much simpler.
> >>
> > 
> > I question whether ABI stability (at the level of the shared library
> > ABI version) will ever be practical in this project. In the case of
> > NumPy, there are very few C data structures that could actually
> > change. In this library we have a great deal more data structures, I
> > would guess 100s of distinct objects taking into account each C++
> > class in the library. It's one thing to be API forward compatible from
> > major version to major version (with warnings to allow for graceful
> > deprecations) but such forward compatibility strategies are not so
> > readily available when talking about the composition of C++ classes
> > (considering that virtual function tables are part of the ABI).
> > 
> > In any case, as a result of this, I'm not comfortable basing technical
> > decisions on the basis of ABI stability considerations.
> 
> We could at some point define a "stable ABI subset" (for example the
> core classes Array, ArrayData, etc.).  But I'm not sure whether that
> would be sufficient for users.
> 
> Also in any case it would need someone to maintain the regular toolkit
> and continuous integration hooks to prevent ABI breakage.  Otherwise
> it's pointless promising something that we can't efficiently monitor.
> 
> Regards
> 
> Antoine.
>


Re: [C++] Private implementations and virtual interfaces

2019-07-27 Thread Uwe L. Korn
The PIMPL is a thing I would trade a bit of performance as it brings ABI 
stability. This is something that will help us making Arrow usage in thirdparty 
code much simpler.

Simple updates when an API was only extended but the ABI is intact is a great 
ease on the Arrow consumer side. I know that this is a bit more hassle on the 
developer side but it's something I really love with NumPy. It's so much 
simpler to do a version upgrade than with an ABI breaking library such as Arrow.

Uwe

> Am 27.07.2019 um 22:57 schrieb Jed Brown :
> 
> Wes McKinney  writes:
> 
>> The abstract/all-virtual base has some benefits:
>> 
>> * No need to implement "forwarding" methods to the private implementation
>> * Do not have to declare "friend" classes in the header for some cases
>> where other classes need to access the methods of a private
>> implementation
>> * Implementation symbols do not need to be exported in DLLs with an
>> *_EXPORT macro
>> 
>> There are some drawbacks, or cases where this method cannot be applied, 
>> though:
>> 
>> * An implementation of some other abstract interface which needs to
>> appear in a public header may not be able to use this approach.
>> * My understanding is that the PIMPL pattern will perform better for
>> non-virtual functions that are called a lot. It'd be helpful to know
>> the magnitude of the performance difference
>> * Complex inheritance patterns may require use of virtual inheritance,
>> which can create a burden for downstream users (e.g. they may have to
>> use dynamic_cast to convert between types in the class hierarchy)
> 
> I would add these two points, which may or may not be a significant
> concern to you:
> 
> * When you add new methods to the abstract virtual model, you change the
>  ABI [1] and therefore need to recompile client code.  This has many
>  consequences for distribution.
> 
> * PIMPL gives a well-defined place for input validation and setting
>  debugger breakpoints even when you don't know which implementation
>  will be used.
> 
> 
> [1] The ABI changes because code to index into the vtable is inlined at
> the call site.  Adding to your example
> 
>  void foo(VirtualType ) {
>obj.Method1();
>// any other line to suppress tail call
>  }
> 
> produces assembly like
> 
>  movrax,QWORD PTR [rdi] ; load vtable for obj
>  call   QWORD PTR [rax+0x10]; 0x10 is offset into vtable
> 
> A different method will use a different offset.  If you add a method,
> offsets of existing methods may change.  With PIMPL, the equivalent
> indexing code resides in your library instead of client code, and yields
> a static (or PLT) call resolved by the linker:
> 
>  call_ZN9PIMPLType7Method1Ev@PLT



Re: [Discuss] Do a 0.15.0 release before 1.0.0?

2019-07-23 Thread Uwe L. Korn
It is also a good way to test the change in public. We don't want to adjust 
something like this anymore in a 1.0.0 release. Already doing this in 0.15.0 
and then maybe doing adjustments due to issues that appear "in the wild" is 
psychologically the easier way. There is a lot of thinking of users bound with 
the magic 1.0, thus I would plan to minimize what is changed between 1.0 and 
pre-1.0. This also should save us maintainers some time as I would expect 
different behaviour in bug reports between 1.0 and pre-1.0 issues.

Uwe

On Tue, Jul 23, 2019, at 7:52 AM, Micah Kornfield wrote:
> I think the main reason to do a release before 1.0.0 is if we want to make
> the change that would give a good error message for forward incompatibility
> (I think this could be done as 0.14.2 since it would just be clarifying an
> error message).  Otherwise, I think including it in 1.0.0 would be fine
> (its still not clear to me if there is consensus to fix the issue).
> 
> Thanks,
> Micah
> 
> 
> On Monday, July 22, 2019, Wes McKinney  wrote:
> 
> > I'd be satisfied with fixing the Flatbuffer alignment issue either in
> > a 0.15.0 or 1.0.0. In the interest of expediency, though, making a
> > 0.15.0 with this change sooner rather than later might be prudent.
> >
> > On Mon, Jul 22, 2019 at 12:35 PM Antoine Pitrou 
> > wrote:
> > >
> > >
> > > Hello,
> > >
> > > Recently we've discussed breaking the IPC format to fix a long-standing
> > > alignment issue.  See this discussion:
> > >
> > https://lists.apache.org/thread.html/8cea56f2069710ac128ff9129c744f0ef96a3e33a4d79d7e820019af@%3Cdev.arrow.apache.org%3E
> > >
> > > Should we first do a 0.15.0 in order to get those format fixes right?
> > > Once that is fine and settled we can move to the 1.0.0 release?
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: Caution about CI builds on personal forks

2019-07-17 Thread Uwe L. Korn
Docker works well for all people on all OSes. 

Interesting will be Windows, OSX or aarch64 builds which require a special 
system. 

Uwe

On Wed, Jul 17, 2019, at 6:11 PM, Antoine Pitrou wrote:
> 
> I'm not sure how Docker will work for people not on Linux though?
> (and/or for macOS builds)
> 
> Regards
> 
> Antoine.
> 
> 
> On Wed, 17 Jul 2019 10:54:13 -0500
> Wes McKinney  wrote:
> 
> > Presumably a migration away from Travis also means that we have to
> > develop tools to allow contributors to test their patches outside of
> > the GitHub pull request. If something is Docker-based, then it can be
> > run locally, of course.
> > 
> > We definitely can't persist under the current circumstances where
> > builds take hours to even begin. Here's an example of a PR that was
> > approved 3 hours ago but whose Travis builds only started about 10
> > minutes ago (and will have to run for at least another 30-60 minutes)
> > 
> > https://github.com/apache/arrow/pull/4894
> > 
> > I think we need to get to an SLA where we're getting feedback on PRs
> > in 90 minutes or less.
> > 
> > On Wed, Jul 17, 2019 at 10:47 AM Neal Richardson
> >  wrote:
> > >
> > > Won't moving CI away from Travis to our own infrastructure mean that we
> > > won't get any CI on our personal forks?
> > >
> > > On Wed, Jul 17, 2019 at 8:23 AM Wes McKinney  wrote:
> > >  
> > > > On Wed, Jul 17, 2019 at 10:22 AM Wes McKinney  
> > > > wrote:  
> > > > >
> > > > > hi folks -- I noticed this last night on
> > > > > https://github.com/apache/arrow/pull/4841 and it surprised me. Others
> > > > > may not be aware.
> > > > >
> > > > > We have been using builds on Appveyor and Travis CI to decide whether
> > > > > to merge PRs. The trouble is these builds are not equivalent to the
> > > > > builds that Travis runs inside the PR (using the apache/arrow build
> > > > > queue). The differences are:
> > > > >  
> > > >
> > > > *missing crucial detail: "builds on personal forks"
> > > >  
> > > > > * They do not take into account changes in master (IOW to test if the
> > > > > build works after `git merge`)
> > > > > * They only test the latest commit versus the previous one in the 
> > > > > branch
> > > > >
> > > > > This latter item is insidious, because of the `detect-changes.py`
> > > > > script. Suppose that you have a large PR that touches many components,
> > > > > and you push a commit that only affects one of them. Then the
> > > > > detect-changes.py script will cause Travis to only run builds for the
> > > > > affected component in the most recent commit.
> > > > >
> > > > > Here's an example of such a spurious build
> > > > >
> > > > > https://travis-ci.org/wesm/arrow/builds/559745190
> > > > >
> > > > > There are a few ways we can mitigate this last issue:
> > > > >
> > > > > * If you need a faster build, you can squash your commits and rebase
> > > > > on master before pushing to make sure that Travis "sees" everything.
> > > > > Note this still carries risk of conflicting changes in master causing
> > > > > a broken build post-merge
> > > > >
> > > > > * We can change the Travis configuration to try to detect whether or
> > > > > not we are testing a PR -- the detect-changes.py logic is really only
> > > > > intended to speed up builds in apache/arrow
> > > > >
> > > > > Overall, I think we need to accelerate our exodus from Travis CI since
> > > > > it's hurting the project's productivity to be waiting so long on
> > > > > builds. We've moved a couple of jobs to be Docker-based but we have
> > > > > quite a lot more work to do to decouple ourselves.
> > > > >
> > > > > Thanks
> > > > > Wes  
> > > >  
> > 
> 
> 
> 
>


Re: Sharing Java Arrow Buffer with C++ in same process

2019-07-17 Thread Uwe L. Korn
Hello Hans,

we sadly have no code for the C++<->Java interaction but a good example is the 
Python<->Java interaction code in 
https://github.com/apache/arrow/blob/master/python/pyarrow/jvm.py . This call 
Java from Python using the jpype1 module and then uses the memory pointers in 
the Java objects to construct pyarrow objects out of it.

Cheers
Uwe

On Wed, Jul 17, 2019, at 2:38 PM, hans-joachim.bo...@web.de wrote:
> In my application I need to share Arrow buffers allocated in Java with 
> C++ in the same process.
> Is there already some code in Arrow to pass the native address from 
> Java to C++ or do I have to do my own JNI call?
> I do not want to go via the Plasma sockets and did not find any hint in 
> docs and Jira.
> 
> Can anybody point me to the right place or confirm that this is to be done ?
> 
> Thanks,
> Hans.
>


[jira] [Created] (ARROW-5919) [R] Add nightly tests for building r-arrow with dependencies from conda-forge

2019-07-12 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5919:
--

 Summary: [R] Add nightly tests for building r-arrow with 
dependencies from conda-forge
 Key: ARROW-5919
 URL: https://issues.apache.org/jira/browse/ARROW-5919
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Python] Wheel questions

2019-07-12 Thread Uwe L. Korn
Hallo,

On Thu, Jul 11, 2019, at 9:51 PM, Wes McKinney wrote:
> On Thu, Jul 11, 2019 at 11:26 AM Antoine Pitrou  wrote:
> >
> >
> > Le 11/07/2019 à 17:52, Krisztián Szűcs a écrit :
> > > Hi All,
> > >
> > > I have a couple of questions about the wheel packaging:
> > > - why do we build an arrow namespaced boost on linux and osx, could we 
> > > link
> > > statically like with the windows wheels?
> >
> > No idea.  Boost shouldn't leak in the public APIs, so theoretically a
> > static build would be fine...

Static linkage is fine as long as we don't expose any Boost symbols. We had 
that historically in the Decimals. If this is gone, we can switch static 
linkage.

> > > - do we explicitly say somewhere in the linux wheels to link the 3rdparty
> > > dependencies statically or just implicitly, by removing (or not building)
> > > the shared libs for the 3rdparty dependencies?
> >
> > It's implicit by removing the shared libs (or not building them).
> > Some time ago the compression libs were always linked statically by
> > default, but it was changed to dynamic along the time, probably to
> > please system packagers.
> 
> I think only libz shared library is being bundled, for security reasons

Ah, yes. This was why we made the dynamic linkage! Can you add a comment the 
next time you touch the build scripts?

> > > - couldn't we use the 3rdparty toolchain to build the smaller 3rdparty
> > > dependencies for the linux wheels instead of building them manually in the
> > > manylinux docker image - it'd easier to say _SOURCE=BUNDLED
> >
> > I don't think so.  The conda-forge and Anaconda packages use a different
> > build chain (different compiler, different libstdc++ version) and may
> > not be usable directly on manylinux-compliant systems.
> 
> I think you may misunderstand. Krisztian is suggesting building the
> dependencies through the ExternalProject mechanism during "docker run"
> on the image rather than caching pre-built versions in the Docker
> image.
> 
> For small dependencies, I don't see why we couldn't used the BUNDLED
> approach. This might spare us having to maintain some of the build
> scripts. It will strictly increase build times, though -- I think the
> reason that everything is cached now is to save on build times (which
> have historically been quite long)

Actually the most pragmatic way I have thought of yet would be to use conda and 
build all our dependencies. Instead of using the compilers defaults and 
conda-forge use, we should build the dependencies in the manylinux image 
and then upload them to a custom channel. This should also make the maintenance 
of the arrow-manylinx docker container easy as this won't require you then to 
do a full recompile of LLVM just because you changed something in a preceeding 
step.

Uwe


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya Fan,

here your best approach is to copy into the Arrow format as you can then use 
this as the basis for working with the Arrow-native representation as well as 
your internal representation. You will have to use two different offset vector 
as those two will always differ but in the case of your internal 
representation, you don't have the requirement of consecutive data as Arrow has 
but you can still work with the strings just as before even when stored 
consecutively.

Uwe

On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> Hi Korn,
> 
> Thanks a lot for your comments.
> 
> In my opinion, your comments make sense to me. Allowing non-consecutive
> memory segments will break some good design choices of Arrow.
> However, there are wide-spread user requirements for non-consecutive memory
> segments. I am wondering how can we help such users. What advice we can
> give to them?
> 
> Memory copy/move can be a solution, but is there a better solution?
> Is there a third alternative? Can we virtualize the non-consecutive memory
> segments into a consecutive one? (Although performance overhead is
> unavoidable.)
> 
> What do you think? Let's brain-storm it.
> 
> Best,
> Liya Fan
> 
> 
> On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn  wrote:
> 
> > Hello Liya,
> >
> > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > We have opened the standard also to matrix-like types but always keep the
> > constraint of consecutive memory. Now also adding types where memory is no
> > longer consecutive but spread in the heap will make the scope of the
> > project much wider (It seems that we then just turn into a general
> > serialization framework).
> >
> > One of the ideas of a common standard is that some need to make
> > compromises. I think in this case it is a necessary compromise to not allow
> > all kind of string representations.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > Hi all,
> > >
> > >
> > > We are thinking of providing varchar/varbinary vectors with a different
> > > memory layout which exists in a wide range of systems. The memory layout
> > is
> > > different from that of VarCharVector in the following ways:
> > >
> > >
> > >1.
> > >
> > >Instead of storing (start offset, end offset), the new layout stores
> > >(start offset, length)
> > >2.
> > >
> > >The content of varchars may not be in a consecutive memory region.
> > >Instead, it can be in arbitrary memory address.
> > >
> > >
> > > Due to these differences in memory layout, it incurs performance overhead
> > > when converting data between existing systems and VarCharVectors.
> > >
> > > The above difference 1 seems insignificant, while difference 2 is
> > difficult
> > > to overcome. However, the scenario of difference 2 is prevalent in
> > > practice: for example we store strings in a series of memory segments.
> > > Whenever a segment is full, we request a new one. However, these memory
> > > segments may not be consecutive, because other processes/threads are also
> > > requesting/releasing memory segments in the meantime.
> > >
> > > So we are wondering if it is possible to support such memory layout in
> > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > but are hindered by such difficulty.
> > >
> > > Would you please give your valuable feedback?
> > >
> > >
> > > Best,
> > >
> > > Liya Fan
> > >
> >
>


Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

2019-07-11 Thread Uwe L. Korn
Hello Liya,

I'm quite -1 on this type as Arrow is about efficient columnar structures. We 
have opened the standard also to matrix-like types but always keep the 
constraint of consecutive memory. Now also adding types where memory is no 
longer consecutive but spread in the heap will make the scope of the project 
much wider (It seems that we then just turn into a general serialization 
framework).

One of the ideas of a common standard is that some need to make compromises. I 
think in this case it is a necessary compromise to not allow all kind of string 
representations.

Uwe

On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> Hi all,
> 
> 
> We are thinking of providing varchar/varbinary vectors with a different
> memory layout which exists in a wide range of systems. The memory layout is
> different from that of VarCharVector in the following ways:
> 
> 
>1.
> 
>Instead of storing (start offset, end offset), the new layout stores
>(start offset, length)
>2.
> 
>The content of varchars may not be in a consecutive memory region.
>Instead, it can be in arbitrary memory address.
> 
> 
> Due to these differences in memory layout, it incurs performance overhead
> when converting data between existing systems and VarCharVectors.
> 
> The above difference 1 seems insignificant, while difference 2 is difficult
> to overcome. However, the scenario of difference 2 is prevalent in
> practice: for example we store strings in a series of memory segments.
> Whenever a segment is full, we request a new one. However, these memory
> segments may not be consecutive, because other processes/threads are also
> requesting/releasing memory segments in the meantime.
> 
> So we are wondering if it is possible to support such memory layout in
> Arrow. I think there are more systems that are trying to adopting Arrow,
> but are hindered by such difficulty.
> 
> Would you please give your valuable feedback?
> 
> 
> Best,
> 
> Liya Fan
>


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Uwe L. Korn
This sounds fine to me, thus I'm +1 on removing this class.

On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> Yes, the schema would be the point of truth for the Field. The ChunkedArray
> type would have to be validated against the schema types as with RecordBatch
> 
> On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn  wrote:
> 
> > Hello Wes,
> >
> > where do you intend the Field object living then? Would this be part of
> > the schema of the Table object?
> >
> > Uwe
> >
> > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > For some time now I have been uncertain about the utility provided by
> > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > things:
> > >
> > > * An arrow::Field object (name and data type)
> > > * An arrow::ChunkedArray object for the data
> > >
> > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > for the arrow::Table class which represents a collection of
> > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > Sometimes a Table will have mostly columns with a single chunk while
> > > some columns will have many chunks.
> > >
> > > I'm concerned about continuing to maintain the Column class as it's
> > > spilling complexity into computational libraries and bindings alike.
> > >
> > > The Python Column class for example mostly forwards method calls to
> > > the underlying ChunkedArray
> > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > >
> > > If the developer wants to construct a Table or insert a new "column",
> > > Column objects must generally be constructed, leading to boilerplate
> > > without clear benefit.
> > >
> > > Since we're discussing building a more significant higher-level
> > > DataFrame interface per past mailing list discussions, my preference
> > > would be to consider removing the Column class to make the user- and
> > > developer-facing data structures simpler. I hate to propose breaking
> > > API changes, so it may not be practical at this point, but I wanted to
> > > at least bring up the issue to see if others have opinions after
> > > working with the library for a few years.
> > >
> > > Thanks
> > > Wes
> > >
> >
>


Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

2019-07-09 Thread Uwe L. Korn
Hello Wes,

where do you intend the Field object living then? Would this be part of the 
schema of the Table object?

Uwe

On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> hi folks,
> 
> For some time now I have been uncertain about the utility provided by
> the arrow::Column C++ class. Fundamentally, it is a container for two
> things:
> 
> * An arrow::Field object (name and data type)
> * An arrow::ChunkedArray object for the data
> 
> It was added to the C++ library in ARROW-23 in March 2016 as the basis
> for the arrow::Table class which represents a collection of
> ChunkedArray objects coming usually from multiple RecordBatches.
> Sometimes a Table will have mostly columns with a single chunk while
> some columns will have many chunks.
> 
> I'm concerned about continuing to maintain the Column class as it's
> spilling complexity into computational libraries and bindings alike.
> 
> The Python Column class for example mostly forwards method calls to
> the underlying ChunkedArray
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> 
> If the developer wants to construct a Table or insert a new "column",
> Column objects must generally be constructed, leading to boilerplate
> without clear benefit.
> 
> Since we're discussing building a more significant higher-level
> DataFrame interface per past mailing list discussions, my preference
> would be to consider removing the Column class to make the user- and
> developer-facing data structures simpler. I hate to propose breaking
> API changes, so it may not be practical at this point, but I wanted to
> at least bring up the issue to see if others have opinions after
> working with the library for a few years.
> 
> Thanks
> Wes
>


Re: [DISCUSS] C++ SO versioning with 1.0.0

2019-07-03 Thread Uwe L. Korn
I've documented that some time ago: 
https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst

I actually wanted to add this to the build but we were breaking the ABI so 
often that it would have never been green.

Uwe

On Wed, Jul 3, 2019, at 9:52 PM, Sutou Kouhei wrote:
> Ruby uses ABI Compliance Checker
> https://lvc.github.io/abi-compliance-checker/
> with a small script:
> 
>   https://github.com/ruby/chkbuild/blob/master/abi-checker.rb
> 
> There is the official Debian package for it:
> 
>   https://packages.debian.org/search?keywords=abi-compliance-checker
> 
> In <20c3b917-6f80-ca14-669d-f89e7ec7f...@python.org>
>   "Re: [DISCUSS] C++ SO versioning with 1.0.0" on Wed, 3 Jul 2019 
> 09:59:15 +0200,
>   Antoine Pitrou  wrote:
> 
> > 
> > Do we have any reliable tool to check for ABI breakage?
> > 
> > 
> > Le 03/07/2019 à 02:57, Sutou Kouhei a écrit :
> >> Hi,
> >> 
> >> We'll release 0.14.0 soon. Then we use "1.0.0-SNAPSHOT" at
> >> master. If we use "1.0.0-SNAPSHOT", C++ build is failed:
> >> 
> >> https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L47
> >> 
> >>   message(FATAL_ERROR "Need to implement SO version generation for Arrow 
> >> 1.0+")
> >> 
> >> So we need to consider how to generate SO version for 1.0.0
> >> as the first task for 1.0.0.
> >> 
> >> See also https://issues.apache.org/jira/browse/ARROW-2522
> >> for the current SO versioning.
> >> 
> >> 
> >> If we may break ABI compatibility each minor version up
> >> release ("Y" is increased in "X.Y.Z"), we should include
> >> minor version into SO major version (100, 101 and 102 in the
> >> following examples):
> >> 
> >>   * 1.0.0 -> libarrow.100.0.0
> >>   * 1.1.0 -> libarrow.101.0.0
> >>   * 1.2.0 -> libarrow.102.0.0
> >> 
> >> If we don't break ABI compatibility each minor version up
> >> release, we just use the same SO major version (100 in the
> >> following examples) in 1.0.0:
> >> 
> >>   * 1.0.0 -> libarrow.100.0.0
> >>   * 1.1.0 -> libarrow.100.1.0
> >>   * 1.2.0 -> libarrow.100.2.0
> >> 
> >> 
> >> I choose 1XX as SO major version because we already use
> >> 10-14 for SO major version. We should not use them in the
> >> future to avoid confusion. So I choose 1XX in the above
> >> examples.
> >> 
> >> 
> >> Any thoughts?
> >> 
> >> 
> >> Thanks,
> >> --
> >> kou
> >> 
>


Re: New CI system: Ursabot

2019-06-16 Thread Uwe L. Korn



On Fri, Jun 14, 2019, at 11:23 PM, Krisztián Szűcs wrote:
> On Fri, Jun 14, 2019 at 9:04 PM Wes McKinney  wrote:
> 
> > hi Krisz,
> >
> > Thanks for working on this! It already helped me fix a Python 2.7-only
> > bug yesterday https://github.com/apache/arrow/pull/4553
> >
> > I have a bunch of questions:
> >
> > * What is the license of the ursabot codebase? Seems like it could be
> > GPL if Buildbot itself is [2] and you have reused any Buildbot code.
> > This is not mentioned in the README so I think you need to call this
> > out and advise people to be careful that any code contributed to this
> > codebase is stuck in GPL-land. Is it possible to separate the work
> > you've done from any GPL-ness? I don't think you can be too paranoid
> > about this kind of thing, and the longer you wait to draw a clear line
> > around any contaminated code the harder it will be to disentangle.
> >
> Created an issue for it https://github.com/ursa-labs/ursabot/issues/105
> 
> >
> > * How brittle is the Buildbot master? It currently resides in my home,
> > but what if a natural disaster (like [1] from 2010) occurs in
> > Nashville causing an extended power outage (or a temporary outage
> > requiring human intervention while I'm out of town)? Is the Buildbot
> > master state backed up? Can it be easily migrated to a new host
> > (Dockerized, even)? Either way we need a contingency plan.
> >
> It doesn't have a backup yet, although it only matters for historical
> reasons.
> If I prune its database and restart the buildmaster nothing changes except
> the losing the previous builds' result - so everything will work the same
> way
> as before erasing the database.
> It can be pretty easily migrated to another host, besides setting up the
> networking with the workers and a database server (or sqlite) nothing more
> esoteric is required.
> 
> >
> > * The availability of Buildbot suggests we should try to decouple our
> > CI procedures from anything Travis CI specific and use Docker instead,
> > at least for the Linux builds. This has the side benefit of enabling
> > contributors to reproduce CI build locally. Can you create some JIRAs
> > about this?
> >
> Yes, IMO docker is preferable, even for Windows containers [1]. Will do, but
> I'm curious for other's opinion too. @Uwe?

Yes, also for Windows builds docker is really nice for getting a clean 
reproducible state. Only for OSX builds, you will need to boot VMs these days 
when you want clean separation.

Uwe


Re: Reduced Arrow CI capacity

2019-05-31 Thread Uwe L. Korn



On Fri, May 31, 2019, at 12:11 AM, Antoine Pitrou wrote:
> 
> Le 30/05/2019 à 22:39, Uwe L. Korn a écrit :
> > Hello all,
> > 
> > Krisztián has been lately working on getting Buildbot running for Arrow. 
> > While I have not yet had the time to look at it in detail what would hinder 
> > us using it as the main Linux builder and ditching Travis except for OSX?
> > 
> > Otherwise I have lately made really good experiences with Gitlab CI 
> > connected to Github projects. While they only offer a comparatively small 
> > amount of CI time per month per project (2000 minutes is quite small in the 
> > Arrow case), I enjoyed that you can connect your own builders to their 
> > hosted gitlab.com instance. This would enable us to easily add funded 
> > workers to the project as well as utilise special hardware that we would 
> > not otherwise get in public CI instances. The CI runners ("workers") are 
> > really simple to setup (It took me on Windows and on Linux less than 5min 
> > each) and the logs show up in the hosted UI.
> 
> Are there any security issues with running self-hosted workers?
> Another question is whether Gitlab CI is allowed on Github repos owned
> by the Apache Foundation (Azure Pipelines still isn't).


The security implications are the same with any self-hosted, docker based CI: 
There are certain chances people can escape the docker sandbox and do nasty 
things on the host. Thus we shouldn't store any additional credentials on the 
host except what is needed to connect to the gitlab master.

I'm not sure about the requirements from Gitlab for the integration. They 
provide a hook for the CI status and a full-blown sync integration. The latter 
really wants all-access which the ASF INFRA won't grant for the former we may 
not even need INFRA but I have to look deeper into that.

Uwe


Re: Reduced Arrow CI capacity

2019-05-30 Thread Uwe L. Korn
Hello all,

Krisztián has been lately working on getting Buildbot running for Arrow. While 
I have not yet had the time to look at it in detail what would hinder us using 
it as the main Linux builder and ditching Travis except for OSX?

Otherwise I have lately made really good experiences with Gitlab CI connected 
to Github projects. While they only offer a comparatively small amount of CI 
time per month per project (2000 minutes is quite small in the Arrow case), I 
enjoyed that you can connect your own builders to their hosted gitlab.com 
instance. This would enable us to easily add funded workers to the project as 
well as utilise special hardware that we would not otherwise get in public CI 
instances. The CI runners ("workers") are really simple to setup (It took me on 
Windows and on Linux less than 5min each) and the logs show up in the hosted UI.

Uwe

On Thu, May 30, 2019, at 8:28 PM, Wes McKinney wrote:
> hi folks,
> 
> Over the last several months we have probably been overutilizing the
> ASF's Travis CI capacity; it seems that we are being limited as of
> very recently to 5 concurrent build workers so CI feedback is taking
> longer.
> 
> We have an INFRA ticket about the issue here
> 
> https://issues.apache.org/jira/browse/INFRA-18533
> 
> If anyone has funding available to help pay for additional CI capacity
> for Apache Arrow please let me know online or offline. We will assess
> our options based on what the Infrastructure team says
> 
> thanks,
> Wes
>


Re: Not testing Python 2.7 on CI

2019-05-30 Thread Uwe L. Korn
Hello Antoine,

when we're not testing Python 2.7 on CI anymore, I would suggest to drop Python 
2 support completely then. My personal experience tells me that once we drop 
Python 2 on CI, we will immediately build a simple thing that breaks Python 2 
support. 

Pushing out releases that might work for Python 2 but in the end won't do will 
make Arrow users more frustrated than just dropping support for it. In 
addition, we would also have to deal with a lot of user reports when we break 
it, explicitly dropping Py2 would just mean that a "python2 -m pip install 
pyarrow" will leave them with an old but functioning version.

Regards
Uwe

On Thu, May 30, 2019, at 4:15 PM, Antoine Pitrou wrote:
> 
> Hello,
> 
> Python 2.7 will soon be end-of-life.  It will stop being supported
> upstream on January 1st, 2020.  Many projects have started publishing
> Python 3-only releases (see https://python3statement.org/).
> 
> PyArrow will soon stop supporting Python 2 as well, perhaps at the end
> of the year.
> 
> In the meantime, as build times on public Continuous Integration
> services have continuously ballooned, we could start by not testing
> Python 2 anymore on those services.  It would easy build times a bit.
> Testing would be done as a best effort thing, possibly by users of
> development versions.
> 
> Regards
> 
> Antoine.
> 
> 
>


Re: Python development setup and LLVM 7 / Gandiva

2019-05-26 Thread Uwe L. Korn
Hello John,

I guess you also have some other llvm-* packages installed on OSX. We currently 
have the problem that they override each other on OSX: 
https://github.com/conda-forge/llvmdev-feedstock/issues/60 The compilers 
shipped by conda-forge on OSX use llvm=4.0.1 and thus this is also installed at 
the same time.

Uwe

On Thu, May 23, 2019, at 9:31 PM, John Muehlhausen wrote:
> Not sure why cmake isn't happy (as in original post).  Environment is set
> up as per instructions:
> 
> (pyarrow-dev) JGM-KTG-Mac-Mini:python jmuehlhausen$ conda list llvmdev
> # packages in environment at
> /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev:
> #
> # NameVersion   Build  Channel
> llvmdev   7.0.0 h04f5b5a_1000conda-forge
> 
> On Thu, May 23, 2019 at 1:46 PM Wes McKinney  wrote:
> 
> > llvmdev=7 is in the conda_env_cpp.yml requirements file, are you using
> > something else?
> >
> > https://github.com/apache/arrow/blob/master/ci/conda_env_cpp.yml#L31
> >
> > On Thu, May 23, 2019 at 12:53 PM John Muehlhausen  wrote:
> > >
> > > The pyarrow-dev conda environment does not include llvm 7, which appears
> > to
> > > be a requirement for Gandiva.
> > >
> > > So I'm just trying to figure out a pain-free way to add llvm 7 in a way
> > > that cmake can find it, for Mac.
> > >
> > > I had already solved the other Mac problem with
> > > export CONDA_BUILD_SYSROOT=/Users/jmuehlhausen/sdks/MacOSX10.9.sdk
> > >
> > > On Wed, May 22, 2019 at 1:46 PM Wes McKinney 
> > wrote:
> > >
> > > > hi John,
> > > >
> > > > Some changes were just made to address the issue you are having, see
> > > > the latest instructions at
> > > >
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst
> > > >
> > > > Let us know if that does not work.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, May 22, 2019 at 11:02 AM John Muehlhausen  wrote:
> > > > >
> > > > > Set up pyarrow-dev conda environment as at
> > > > > https://arrow.apache.org/docs/developers/python.html
> > > > >
> > > > > Got the following error.  I will disable Gandiva for now but I'd
> > like to
> > > > > get it back at some point.  I'm on Mac OS 10.13.6.
> > > > >
> > > > > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
> > > > >   Could not find a configuration file for package "LLVM" that is
> > > > compatible
> > > > >   with requested version "7.0".
> > > > >
> > > > >   The following configuration files were considered but not accepted:
> > > > >
> > > > >
> > > > >
> > > >
> > /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/LLVMConfig.cmake,
> > > > > version: 4.0.1
> > > > >
> > > > >
> > > >
> > /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/llvm-config.cmake,
> > > > > version: unknown
> > > > >
> > > > > Call Stack (most recent call first):
> > > > >   src/gandiva/CMakeLists.txt:31 (find_package)
> > > >
> >
>


Re: [Python] Any reason to exclude __lt__ from ArrayValue ?

2019-05-26 Thread Uwe L. Korn
Hello John,

as with most things concering the *Value classes: Missing implementations are 
simply "not-done-yet" and not explicit omissions. The value instances have not 
yet seen that much use and therefore lack a lot of functionality. Feel free to 
add this to them.

Uwe

On Sat, May 25, 2019, at 6:01 AM, John Muehlhausen wrote:
> We have __eq__ leaning on as_py() already ... any reason not to have __lt__
> ?
> 
> This makes it possible to use bisect to find slices in ordered data without
> a __getitem__ wrapper:
> 
> 1176.0  key=pa.array(['AAPL'])
>  110.0  print(bisect.bisect_left(batch[3],key[0]))
>   64.0  print(bisect.bisect_right(batch[3],key[0]))
> 
> Although, I'm not sure why pa.array() is relatively slow (above in mics)
> and whether I can directly construct an individual Value instead?  batch[3]
> is string type with length 32291182 (memory mapped from IPC File)... AAPL
> being slice 206424 to 420255 in this case.
> 
> Proposed addition:
> 
> def __eq__(self, other):
> if hasattr(self, 'as_py'):
> if isinstance(other, ArrayValue):
> other = other.as_py()
> return self.as_py() == other
> else:
> raise NotImplementedError(
> "Cannot compare Arrow values that don't support as_py()")
> 
> *def __lt__(self, other):*
> *if hasattr(self, 'as_py'):*
> *if isinstance(other, ArrayValue):*
> *other = other.as_py()*
> *return self.as_py() < other*
> *else:*
> *raise NotImplementedError(*
> *"Cannot compare Arrow values that don't support as_py()")*
>


Re: A couple of questions about pyarrow.parquet

2019-05-23 Thread Uwe L. Korn
Hello Ted,

regarding predicate pushdown in Python, have a look at my unfinished PR at 
https://github.com/apache/arrow/pull/2623. This was stopped since we were 
missing native filter in Arrow. The requirements for that have now been 
implemented and we could probably reactivate the PR.

Uwe

On Sat, May 18, 2019, at 3:53 AM, Ted Gooch wrote:
> Thanks Micah and Wes.
> 
> Definitely interested in the *Predicate Pushdown* and *Schema inference,
> schema-on-read, and schema normalization *sections.
> 
> On Fri, May 17, 2019 at 12:47 PM Wes McKinney  wrote:
> 
> > Please see also
> >
> >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
> >
> > And prior mailing list discussion. I will comment in more detail on the
> > other items later
> >
> > On Fri, May 17, 2019, 2:44 PM Micah Kornfield 
> > wrote:
> >
> > > I can't help on the first question.
> > >
> > > Regarding push-down predicates, there is an open JIRA [1] to do just that
> > >
> > > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > > <
> > >
> > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > > >
> > >
> > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
> > >
> > > > Hi,
> > > >
> > > > I've been doing some work trying to get the parquet read path going for
> > > the
> > > > python iceberg 
> > library.  I
> > > > have two questions that I couldn't get figured out, and was hoping I
> > > could
> > > > get some guidance from the list here.
> > > >
> > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > > it
> > > > appears that only limited information is available in the ColumnSchema
> > > > passed back to the python client[2]:
> > > >
> > > > 
> > > >   name: key
> > > >   path: m.map.key
> > > >   max_definition_level: 2
> > > >   max_repetition_level: 1
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: key
> > > >   path: m.map.value.map.key
> > > >   max_definition_level: 4
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: value
> > > >   path: m.map.value.map.value
> > > >   max_definition_level: 5
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > >
> > > >
> > > > where physical_type and logical_type are both strings[1].  The arrow
> > > schema
> > > > I can get from *to_arrow_schema *looks to be more expressive(although
> > may
> > > > be I just don't understand the parquet format well enough):
> > > >
> > > > m: struct > list > > > struct not null>>> not null>>
> > > >   child 0, map: list > > list > > > struct not null>>> not null>
> > > >   child 0, map: struct > > > struct not null>>>
> > > >   child 0, key: string
> > > >   child 1, value: struct > > value:
> > > > string> not null>>
> > > >   child 0, map: list > string>
> > > > not null>
> > > >   child 0, map: struct
> > > >   child 0, key: string
> > > >   child 1, value: string
> > > >
> > > >
> > > > It seems like I can infer the info from the name/path, but is there a
> > > more
> > > > direct way of getting the detailed parquet schema information?
> > > >
> > > > Second question, is there a way to push record level filtering into the
> > > > parquet reader, so that the parquet reader only reads in values that
> > > match
> > > > a given predicate expression? Predicate expressions would be simple
> > > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > > connected with logical operators(AND, OR, NOT).
> > > >
> > > > I've seen that after reading-in I can use the filtering language in
> > > > gandiva[3] to get filtered record-batches, but was looking for
> > somewhere
> > > > lower in the stack if possible.
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > > CREATE TABLE `iceberg`.`nested_map` (
> > > > m map>)
> > > > [3]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > > >
> > >
> >
>


Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-05-09 Thread Uwe L. Korn
+1 to the idea of adding a protocol to let other objects define their way to 
Arrow structures. For pandas.Series I would expect that they return an Arrow 
Column. 

For the Arrow->pandas conversion I have a bit mixed feelings. In the normal 
Fletcher case I would expect that we don't convert anything as we represent 
anything from Arrow with it. For the case where we want to restore the exact 
pandas DataFrame we had before this will become a bit more complicated as we 
either would need to have all third-party libraries to support Arrow via a hook 
as proposed or we also define some kind of other protocol on the pandas side to 
reconstruct ExtensionArrays from Arrow data.

Uwe

> Am 09.05.2019 um 18:20 schrieb Antoine Pitrou :
> 
> 
> Arrow arrays don't have metadata, so if you want to pass metadata around
> you should at least add a hook for columns as well.
> 
> Regards
> 
> Antoine.
> 
> 
>> Le 09/05/2019 à 18:10, Joris Van den Bossche a écrit :
>> An additional question might be at which "level" to provide such a hook to
>> third-party packages: I proposed for Array, but what for chunked arrays,
>> columns or tables? Maybe at least returning a chunked array should also be
>> allowed.
>> 
>> Op do 9 mei 2019 om 18:06 schreef Joris Van den Bossche <
>> jorisvandenboss...@gmail.com>:
>> 
>>> The signature I had in mind is something like:
>>> 
>>> def __arrow_array__(self, type : pyarrow.DataType=None) -> pyarrow.Array:
>>> 
>>> where the function returns a pyarrow.Array, and takes an optional data
>>> type (in case there are multiple ways to convert to a pyarrow Array, and
>>> what can be passed by the user in the type argument in pyarrow.array(..) or
>>> in a specified schema).
>>> 
>>> But, the above is only for a one way path of custom array to Arrow array,
>>> and not enough for a full roundtrip.
>>> 
>>> For a full roundtrip in case of a pandas DataFrame, we will still need to
>>> save information in metadata independently from __arrow_array__ and have
>>> custom code in pyarrow to deal with pandas DataFrames (of which there is
>>> already a lot). I mentioned this briefly in
>>> https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
>>> / https://issues.apache.org/jira/browse/ARROW-2428, but one option could
>>> be to save the name of the pandas extension dtype in the pandas_metadata of
>>> an arrow Table (just as already happens for currently supported types), and
>>> when exporting back to pandas with to_pandas pyarrow could check if this
>>> extension dtype name is registered with pandas and if so, call a method
>>> there to construct it.
>>> 
>>> Joris
>>> 
>>> Op do 9 mei 2019 om 17:38 schreef Antoine Pitrou :
>>> 
 
 Hi Joris,
 
 Do you have a signature for __arrow_array__ method in mind?
 
 For example, let's say you want to roundtrip ExtensionArrays or other
 third-party data through Arrow.  How do you preserve the required
 metadata?
 
 Regards
 
 Antoine.
 
 
> Le 09/05/2019 à 13:29, Joris Van den Bossche a écrit :
> Hi all,
> 
> I want to propose an interface to allow custom array objects in Python
 to
> define how they should be converted to Arrow arrays (e.g. in
> pyarrow.array(..)). I opened
> https://issues.apache.org/jira/browse/ARROW-5271 for this.
> This would be similar to the numpy __array__ protocol (so we could eg
 call
> it __arrow_array__).
> Feedback / discussion very welcome!
> 
> I am coming to this discussion specifically from the point of view of
> pandas ExtensionArrays (github issue for this:
> 
 https://github.com/pandas-dev/pandas/issues/20612/#issuecomment-489649556
 ).
> Such a protocol would, for example, make it possible that pandas users
 can
> save DataFrames with ExtensionArrays (eg the nullable integers) to
 parquet,
> without the need for pyarrow to know about all those possible different
> extension arrays. This would also be useful for projects extending
 pandas
> such as GeoPandas  and Fletcher
> .
> But I suppose it could also be of interest more in general of other
> array-like / pandas-like projects that want to interface with arrow.
> 
> Sidenote: for the pandas case, I want to look a the full roundtrip, so
 also
> the conversion back from an arrow Table to DataFrame. For that aspect
 there
> is https://issues.apache.org/jira/browse/ARROW-2428, but this is much
 more
> specific to pandas and its ExtensionArrays.
> 
> Regards,
> Joris
> 
 
>>> 
>> 



[jira] [Created] (ARROW-5265) [Python/CI] Add integration test with kartothek

2019-05-06 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5265:
--

 Summary: [Python/CI] Add integration test with kartothek
 Key: ARROW-5265
 URL: https://issues.apache.org/jira/browse/ARROW-5265
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Python
Reporter: Uwe L. Korn
 Fix For: 0.15.0


https://github.com/JDASoftwareGroup/kartothek is a heavy user of Apache Arrow 
and thus a good indicator whether we have introduced some breakages in 
{{pyarrow}}. Thus we should run regular integration tests against it as we do 
with other libraries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: C++ and Python size problems with Arrow 0.13.0

2019-04-07 Thread Uwe L. Korn


> By the way, I don't understand why those are not symlinks.

They should be symlinks, we have special code for this: 
https://github.com/apache/arrow/blob/4495305092411e8551c60341e273c8aa3c14b282/python/setup.py#L489-L499
 This is probably not going into the wheel as wheels are zip-files and they 
don't support symlinks by default. So we probably need to pass the `--symlinks` 
parameter to the wheel code.

Uwe


Re: C++ and Python size problems with Arrow 0.13.0

2019-04-07 Thread Uwe L. Korn
The only magic that auditwheel does on the Linux package is that it pulls in 
our shared version of libz.so into the wheel, otherwise there should be no 
differences in the wheel contents.

Uwe

On Wed, Apr 3, 2019, at 12:06 PM, Krisztián Szűcs wrote:
> This is what the wheel contains before running auditwheel:
> 
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so
> -rwxr-xr-x  1 root root 128K Apr  3 09:02
> libarrow_boost_filesystem.so.1.66.0
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so.1.66.0
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so.1.66.0
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so.14
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so.14
> -rw-r--r--  1 root root 6.1M Apr  3 09:02 lib.cpp
> -rwxr-xr-x  1 root root 2.4M Apr  3 09:02
> lib.cpython-36m-x86_64-linux-gnu.so
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so.14
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so.14
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so.14
> 
> After running auditwheel, the repaired wheel contains:
> 
> -rwxr-xr-x  1 root root 128K Apr  3 09:02 libarrow_boost_filesystem.so
> -rwxr-xr-x  1 root root 128K Apr  3 09:02
> libarrow_boost_filesystem.so.1.66.0
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so
> -rwxr-xr-x  1 root root 1.2M Apr  3 09:02 libarrow_boost_regex.so.1.66.0
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so
> -rwxr-xr-x  1 root root  30K Apr  3 09:02 libarrow_boost_system.so.1.66.0
> -rwxr-xr-x  1 root root 1.6M Apr  3 09:55 libarrow_python.so
> -rwxr-xr-x  1 root root 1.4M Apr  3 09:02 libarrow_python.so.14
> -rwxr-xr-x  1 root root  12M Apr  3 09:55 libarrow.so
> -rwxr-xr-x  1 root root  12M Apr  3 09:02 libarrow.so.14
> -rw-r--r--  1 root root 6.1M Apr  3 09:02 lib.cpp
> -rwxr-xr-x  1 root root 2.5M Apr  3 09:55
> lib.cpython-36m-x86_64-linux-gnu.so
> -rwxr-xr-x  1 root root  59M Apr  3 09:55 libgandiva.so
> -rwxr-xr-x  1 root root  55M Apr  3 09:02 libgandiva.so.14
> -rwxr-xr-x  1 root root 3.5M Apr  3 09:55 libparquet.so
> -rwxr-xr-x  1 root root 2.9M Apr  3 09:02 libparquet.so.14
> -rwxr-xr-x  1 root root 345K Apr  3 09:55 libplasma.so
> -rwxr-xr-x  1 root root 309K Apr  3 09:02 libplasma.so.14
> 
> Here is the output of auditwheel
> https://travis-ci.org/kszucs/crossbow/builds/514605723#L3340
> 
> On Wed, Apr 3, 2019 at 10:36 AM Antoine Pitrou  wrote:
> 
> >
> > Le 03/04/2019 à 02:23, Wes McKinney a écrit :
> > >
> > > $ ll Library/lib/
> > > total 741796
> > > -rw-r--r-- 1 wesm wesm   1507048 Mar 27 23:34 arrow.lib
> > > -rw-r--r-- 1 wesm wesm 76184 Mar 27 23:35 arrow_python.lib
> > > -rw-r--r-- 1 wesm wesm  61322082 Mar 27 23:36 arrow_python_static.lib
> > > -rw-r--r-- 1 wesm wesm 328090044 Mar 27 23:37 arrow_static.lib
> > > drwxr-xr-x 3 wesm wesm  4096 Apr  2 19:12 cmake/
> > > -rw-r--r-- 1 wesm wesm302496 Mar 27 23:38 gandiva.lib
> > > -rw-r--r-- 1 wesm wesm 239314018 Mar 27 23:40 gandiva_static.lib
> > > -rw-r--r-- 1 wesm wesm491292 Mar 27 23:41 parquet.lib
> > > -rw-r--r-- 1 wesm wesm 128473780 Mar 27 23:42 parquet_static.lib
> > > drwxr-xr-x 2 wesm wesm  4096 Apr  2 19:12 pkgconfig/
> > >
> > > As a mitigating measure in the meantime, I would suggest that we stop
> > > bundling the static libraries in the arrow-cpp conda package, since
> > > we're just hurting release managers and users with a large package
> > > download when they `conda install pyarrow`.
> >
> > Agreed.
> >
> > > Can someone open a JIRA
> > > issue about this?
> >
> > See https://issues.apache.org/jira/browse/ARROW-5101
> >
> > > There's something very odd here, though, which is that libgandiva.so
> > > and libgandiva.so.13 appear to be distinct.
> >
> > Not only.  libparquet.so, libplasma.so and libarrow.so are distinct as
> > well.  This means that we may be building those libraries twice instead
> > of copying the files.
> >
> > By the way, I don't understand why those are not symlinks.
> >
> Me neither, but I guess setup.py bdist_wheel doesn't support symlinks.
> 
> >
> > > That seems buggy to me. We might also investigate if there's a way to
> > > trim the binary sizes in some way.
> >
> > Well, there's always "strip -s", but it doesn't seem to remove much
> > (libgandiva.so shrinks from 60 to 50 MB, and you lose all debug
> > information).
> >
> > One issue seems to be that libgandiva.so links LLVM statically, but
> > doesn't hide LLVM symbols.  That said, libllvmlite.so (which hides LLVM
> > symbols) has grown 

Re: [VOTE] Proposed changes to Arrow Flight protocol

2019-04-07 Thread Uwe L. Korn
+1 (binding)

On Sat, Apr 6, 2019, at 3:09 AM, Kouhei Sutou wrote:
> +1 (binding)
> 
> In 
>   "[VOTE] Proposed changes to Arrow Flight protocol" on Tue, 2 Apr 2019 
> 19:05:27 -0500,
>   Wes McKinney  wrote:
> 
> > Hi,
> > 
> > David Li has proposed to make the following additions or changes
> > to the Flight gRPC service definition [1] and general design, as explained 
> > in
> > greater detail in the linked Google Docs document [2]. Arrow
> > Flight is an in-development messaging framework for creating
> > services that can, among other things, send and receive the Arrow
> > binary protocol without intermediate serialization.
> > 
> > The changes proposed are as follows:
> > 
> > Proposal 1: In FlightData, add a bytes field for application-defined 
> > metadata.
> > In DoPut, change the return type to be streaming, and add a bytes
> > field to PutResult for application-defined metadata.
> > 
> > Proposal 2: In client/server APIs, add a call options parameter to
> > control timeouts and provide access to the identity of the
> > authenticated peer (if any).
> > 
> > Proposal 3: Add an interface to define authentication protocols on the
> > client and server, using the existing Handshake endpoint and adding a
> > protocol-defined, per-call token.
> > 
> > Proposal 4: Construct the client/server using builders to allow
> > configuration of transport-specific options and open the door for
> > alternative transports.
> > 
> > The actual changes will be made through subsequent pull requests
> > that change Flight.proto and the existing Flight implementations
> > in C++ and Java.
> > 
> > Please vote whether to accept the changes. The vote will be open
> > for at least 72 hours.
> > 
> > [ ] +1 Accept these changes to the Flight protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> > 
> > Thanks,
> > Wes
> > 
> > [1]: https://github.com/apache/arrow/blob/master/format/Flight.proto
> > [2]: 
> > https://docs.google.com/document/d/1aIVZ8SD5dMZXHTCeEY9PoNAwyuUgG-UEjmd3zfs1PYM/edit
>


Re: [DRAFT] Apache Arrow ASF Board Report April 2019

2019-04-07 Thread Uwe L. Korn
+1

On Fri, Apr 5, 2019, at 10:02 PM, Wes McKinney wrote:
> ## Description:
> 
> Apache Arrow is a cross-language development platform for in-memory data. It
> specifies a standardized language-independent columnar memory format for flat
> and hierarchical data, organized for efficient analytic operations on modern
> hardware. It also provides computational libraries and zero-copy streaming
> messaging and interprocess communication. Languages currently supported
> include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
> 
> ## Issues:
> - There are no issues requiring board attention at this time
> 
> ## Activity:
>  - The project received a donation of DataFusion, a Rust-based query
> engine for Apache Arrow
> 
> ## Health report:
> - The project is very healthy, with a growing number and diversity of
>contributors
> 
> ## PMC changes:
> 
>  - Currently 26 PMC members.
>  - Andrew Grove was added to the PMC on Sun Feb 03 2019
> 
> ## Committer base changes:
> 
>  - Currently 41 committers.
>  - New commmitters:
> - Micah Kornfield was added as a committer on Fri Mar 08 2019
> - Deepak Majeti was added as a committer on Thu Jan 31 2019
> - Paddy Horan was added as a committer on Fri Feb 08 2019
> - Ravindra Pindikura was added as a committer on Fri Feb 01 2019
> - Sun Chao was added as a committer on Fri Feb 22 2019
> 
> ## Releases:
> 
>  - 0.12.0 was released on Sat Jan 26 2019
>  - 0.12.1 was released on Sun Feb 24 2019
>  - 0.13.0 was released on Sun Mar 31 2019
>  - JS-0.4.0 was released on Tue Feb 05 2019
>  - JS-0.4.1 was released on Sat Mar 23 2019
> 
> ## JIRA activity:
> 
>  - 969 JIRA tickets created in the last 3 months
>  - 861 JIRA tickets closed/resolved in the last 3 months
>


Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-07 Thread Uwe L. Korn
+1 (binding)

On Sat, Apr 6, 2019, at 2:44 AM, Kouhei Sutou wrote:
> +1 (binding)
> 
> In 
>   "[VOTE] Add new DurationInterval Type to Arrow Format" on Wed, 3 Apr 
> 2019 07:59:56 -0700,
>   Jacques Nadeau  wrote:
> 
> > I'd like to propose a change to the Arrow format to support a new duration
> > type. Details below. Threads on mailing list around discussion.
> > 
> > 
> > // An absolute length of time unrelated to any calendar artifacts.  For the
> > purposes
> > /// of Arrow Implementations, adding this value to a Timestamp ("t1")
> > naively (i.e. simply summing
> > /// the two number) is acceptable even though in some cases the resulting
> > Timestamp (t2) would
> > /// not account for leap-seconds during the elapsed time between "t1" and
> > "t2".  Similarly, representing
> > /// the difference between two Unix timestamp is acceptable, but would
> > yield a value that is possibly a few seconds
> > /// off from the true elapsed time.
> > ///
> > ///  The resolution defaults to
> > /// millisecond, but can be any of the other supported TimeUnit values as
> > /// with Timestamp and Time types.  This type is always represented as
> > /// an 8-byte integer.
> > table DurationInterval {
> >unit: TimeUnit = MILLISECOND;
> > }
> > 
> > 
> > Please vote whether to accept the changes. The vote will be open
> > for at least 72 hours.
> > 
> > [ ] +1 Accept these changes to the Flight protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
>


[jira] [Created] (ARROW-5074) [C++/Python] When installing into a SYSTEM prefix, RPATHs are not correctly set

2019-03-31 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-5074:
--

 Summary: [C++/Python] When installing into a SYSTEM prefix, RPATHs 
are not correctly set
 Key: ARROW-5074
 URL: https://issues.apache.org/jira/browse/ARROW-5074
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging, Python
Reporter: Uwe L. Korn


When installing the Arrow libraries into a system with a prefix (mostly a conda 
env), the RPATHs are not correctly set by CMake (there is no RPATH). Thus we 
need to use {{LD_LIBRARY_PATH}} in consumers. When packages are built using 
{{conda-build}}, this takes cares of that in its post-processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4987) [C++] Use orc conda-package on Linux and OSX

2019-03-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4987:
--

 Summary: [C++] Use orc conda-package on Linux and OSX
 Key: ARROW-4987
 URL: https://issues.apache.org/jira/browse/ARROW-4987
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow JS 0.4.1 - RC0

2019-03-21 Thread Uwe L. Korn
This saldy fails locally for me on OSX High Sierra:

```
+ npm run test

> apache-arrow@0.4.1 test 
> /private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1
> NODE_NO_WARNINGS=1 gulp test

[15:23:02] Using gulpfile 
/private/var/folders/3j/b8ctc4654q71hd_nqqh8yxc0gp/T/arrow-js-0.4.1.X.8XkDsa8C/apache-arrow-js-0.4.1/gulpfile.js
[15:23:02] Starting 'test'...
[15:23:02] Starting 'test:ts'...
[15:23:02] Starting 'test:src'...
[15:23:02] Starting 'test:apache-arrow'...

  ● Test suite failed to run

TypeError: Cannot assign to read only property 'Symbol(Symbol.toStringTag)' 
of object '#'

  at exports.default 
(node_modules/jest-util/build/create_process_object.js:15:34)
```

This is the same error as in the nightlies but the fix there doesn't help for 
me locally.

Uwe

On Thu, Mar 21, 2019, at 2:41 AM, Brian Hulette wrote:
> +1 (non-binding)
> 
> Ran js-verify-release-candidate.sh on Archlinux w/ node v11.12.0
> 
> Thanks Krisztian!
> Brian
> 
> On Wed, Mar 20, 2019 at 5:40 PM Paul Taylor  wrote:
> 
> > +1 non-binding
> >
> > Ran `dev/release/js-verify-release-candidate.sh 0.4.1 0` on MacOS High
> > Sierra w/ node v11.6.0
> >
> >
> > On Wed, Mar 20, 2019 at 5:21 PM Kouhei Sutou  wrote:
> >
> > > +1 (binding)
> > >
> > > I ran the followings on Debian GNU/Linux sid:
> > >
> > >   * dev/release/js-verify-release-candidate.sh 0.4.1 0
> > >
> > > with:
> > >
> > >   * Node.js v11.12.0
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In 
> > >   "[VOTE] Release Apache Arrow JS 0.4.1 - RC0" on Thu, 21 Mar 2019
> > > 00:09:54 +0100,
> > >   Krisztián Szűcs  wrote:
> > >
> > > > Hello all,
> > > >
> > > > I would like to propose the following release candidate (rc0) of Apache
> > > > Arrow JavaScript version 0.4.1.
> > > >
> > > > The source release rc0 is hosted at [1].
> > > >
> > > > This release candidate is based on commit
> > > > f55542eeb59dde8ff4512c707b9eca1b43b62073
> > > >
> > > > Please download, verify checksums and signatures, run the unit tests,
> > and
> > > > vote
> > > > on the release. The easiest way is to use the JavaScript-specific
> > release
> > > > verification script dev/release/js-verify-release-candidate.sh.
> > > >
> > > > [ ] +1 Release this as Apache Arrow JavaScript 0.4.1
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Arrow JavaScript 0.4.1 because...
> > > >
> > > >
> > > > How to validate a release signature:
> > > > https://httpd.apache.org/dev/verification.html
> > > >
> > > > [1]:
> > > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-js-0.4.1-rc0/
> > > > [2]:
> > > >
> > >
> > https://github.com/apache/arrow/tree/f55542eeb59dde8ff4512c707b9eca1b43b62073
> > >
> >
>


Re: MemoryPool in Arrow libraries

2019-03-21 Thread Uwe L. Korn
Hello,

> On alignment: The Arrow Spec calls for at least 8-byte alignment but
> recommends 64-byte alignment precisely for SIMD use-cases.   There is still
> an open JIRA item [3] to make Java have 64-byte alignment, so I don't think
> Java is handling 64-byte alignment (I don't know about 8-byte alignment,
> which might come for free on 64-bit platforms), and I don't believe much
> work has been done on the C++ implementation to explicitly exploit the
> alignment requirement.

We don't yet make much use of this but things like memcpy should be 
automatically faster on all systems as libc should do dynamic dispatch 
depending on alignment and available instruction set. I have seen this in the 
traces of my local benchmarks.  

Uwe


[jira] [Created] (ARROW-4985) [C++] arrow/testing headers are not installed

2019-03-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4985:
--

 Summary: [C++] arrow/testing headers are not installed
 Key: ARROW-4985
 URL: https://issues.apache.org/jira/browse/ARROW-4985
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Timeline for 0.13 Arrow release

2019-03-19 Thread Uwe L. Korn
; >>>>>>>>> https://github.com/apache/arrow/pull/3671 - [Rust] Table API
> >> (a.k.a
> >>>>>>>>> DataFrame)
> >>>>>>>>>
> >>>>>>>>> https://github.com/apache/arrow/pull/3851 - [Rust] Parquet data
> >>>>>> source
> >>>>>>>> in
> >>>>>>>>> DataFusion
> >>>>>>>>>
> >>>>>>>>> Once these are merged I have some small follow up PRs for 0.13.0
> >>>>>> that I
> >>>>>>>> can
> >>>>>>>>> get done this week.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Andy.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 12, 2019 at 8:21 AM Wes McKinney 
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> hi folks,
> >>>>>>>>>>
> >>>>>>>>>> I think we are on track to be able to release toward the end of
> >>>>>> this
> >>>>>>>>>> month. My proposed timeline:
> >>>>>>>>>>
> >>>>>>>>>> * This week (March 11-15): feature/improvement push mostly
> >>>>>>>>>> * Next week (March 18-22): shift to bug fixes, stabilization,
> >> empty
> >>>>>>>>>> backlog of feature/improvement JIRAs
> >>>>>>>>>> * Week of March 25: propose release candidate
> >>>>>>>>>>
> >>>>>>>>>> Does this seem reasonable? This puts us at about 9-10 weeks from
> >>>>>> 0.12.
> >>>>>>>>>>
> >>>>>>>>>> We need an RM for 0.13, any PMCs want to volunteer?
> >>>>>>>>>>
> >>>>>>>>>> Take a look at our release page:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091219
> >>>>>>>>>>
> >>>>>>>>>> Out of the open or in-progress issues, we have:
> >>>>>>>>>>
> >>>>>>>>>> * C#: 3 issues
> >>>>>>>>>> * C++ (all components): 51 issues
> >>>>>>>>>> * Java: 3 issues
> >>>>>>>>>> * Python: 38 issues
> >>>>>>>>>> * Rust (all components): 33 issues
> >>>>>>>>>>
> >>>>>>>>>> Please help curating the backlogs for each component. There's a
> >>>>>>>>>> smattering of issues in other categories. There are also 10 open
> >>>>>>>>>> issues with No Component (and 20 resolved issues), those need
> >> their
> >>>>>>>>>> metadata fixed.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Wes
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Feb 27, 2019 at 1:49 PM Wes McKinney  >>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> The timeline for the 0.13 release is drawing closer. I would say
> >>>>>> we
> >>>>>>>>>>> should consider a release candidate either the week of March 18
> >>>>>> or
> >>>>>>>>>>> March 25, which gives us ~3 weeks to close out backlog items.
> >>>>>>>>>>>
> >>>>>>>>>>> There are around 220 issues open or in-progress in
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >> https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.13.0+Release
> >>>>>>>>>>>
> >>>>>>>>>>> Please have a look. If issues are not assigned to someone as the
> >>>>>> next
> >>>>>>>>>>> couple of weeks pass by I'll begin moving at least C++ and Python
> >>>>>>>>>>> issues to 0.14 that don't seem like they're going to get done for
> >>>>>>>>>>> 0.13. If development stakeholders for C#, Java, Rust, Ruby, and
> >>>>>> other
> >>>>>>>>>>> components can review and curate the issues that would be
> >>>>>> helpful.
> >>>>>>>>>>>
> >>>>>>>>>>> You can help keep the JIRA issues tidy by making sure to add Fix
> >>>>>>>>>>> Version to issues and to make sure to add a Component so that
> >>>>>> issues
> >>>>>>>>>>> are properly categorized in the release notes.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Wes
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Feb 9, 2019 at 10:39 AM Wes McKinney <
> >>>>>> wesmck...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> See
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >> https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> >>>>>>>>>>>>
> >>>>>>>>>>>> The source release step is one of the places where problems
> >>>>>> occur.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sat, Feb 9, 2019, 10:33 AM  >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Feb 8, 2019, at 9:19 AM, Uwe L. Korn 
> >>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We could dockerize some of the release steps to ensure that
> >>>>>> they
> >>>>>>>>>> run in the same environment.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I may be able to help with said Dockerization. If not for this
> >>>>>>>>>> release, then for the next. Are there docs on which systems we
> >>>>>> wish to
> >>>>>>>>>> target and/or any build steps beyond the current dev container (
> >>>>>>>>>> https://github.com/apache/arrow/tree/master/dev/container)?
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>
> > 
>


[jira] [Created] (ARROW-4960) [R] Add crossbow task for r-arrow-feedstock

2019-03-19 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4960:
--

 Summary: [R] Add crossbow task for r-arrow-feedstock
 Key: ARROW-4960
 URL: https://issues.apache.org/jira/browse/ARROW-4960
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, R
Reporter: Uwe L. Korn
 Fix For: 0.14.0


We also have an R package on conda-forge now: 
[https://github.com/conda-forge/r-arrow-feedstock] This should be tested using 
crossbow as we do with the other packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4948) [JS] Nightly test failing with "Cannot assign to read only property"

2019-03-18 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4948:
--

 Summary: [JS] Nightly test failing with "Cannot assign to read 
only property"
 Key: ARROW-4948
 URL: https://issues.apache.org/jira/browse/ARROW-4948
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Uwe L. Korn
 Fix For: JS-0.5.0


See [https://travis-ci.org/kszucs/crossbow/builds/507807857]

This can be reproduced using {{docker-compose build js && docker-compose run 
js}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: CMake refactor Heads-up

2019-03-18 Thread Uwe L. Korn
I would advise against mixing these environments. Either use only packages from 
defaults or from conda-forge as things like boost from defaults and boost from 
conda-forge can be installed at the same time but then lead to segfaults.

Uwe

On Mon, Mar 18, 2019, at 2:10 PM, Antoine Pitrou wrote:
> 
> Well, in the meantime I can just use the conda-forge packages.
> (though there are regular issues when updating packages where conda
> switches back and forth from Anaconda and conda-forge packages)
> 
> Regards
> 
> Antoine.
> 
> 
> Le 18/03/2019 à 13:59, Uwe L. Korn a écrit :
> > Hello Antoine,
> > 
> > you're running into 
> > https://github.com/ContinuumIO/anaconda-issues/issues/10731 I would rather 
> > have Anaconda fix this but we can also add alternative detection for this. 
> > I've opened https://issues.apache.org/jira/browse/ARROW-4946,  I can then 
> > look into this the next hours/tomorrow. As with double-conversion, 
> > `-DFlatbuffers_SOURCE=BUNDLED` is a possible workaround until then.
> > 
> > Uwe
> > 
> > On Mon, Mar 18, 2019, at 1:55 PM, Antoine Pitrou wrote:
> >>
> >> Ah, apparently I can do it through `-Ddouble-conversion_SOURCE=BUNDLED`.
> >>
> >> Now there's another issue: the CMake configuration fails to find 
> >> flatbuffers,
> >> even though I have flatbuffers 1.7.1 installed from Anaconda.
> >>
> >>
> >> CMake Error at cmake_modules/ThirdpartyToolchain.cmake:152 (find_package):
> >>   By not providing "FindFlatbuffers.cmake" in CMAKE_MODULE_PATH this 
> >> project
> >>   has asked CMake to find a package configuration file provided by
> >>   "Flatbuffers", but CMake did not find one.
> >>
> >>   Could not find a package configuration file provided by "Flatbuffers" 
> >> with
> >>   any of the following names:
> >>
> >> FlatbuffersConfig.cmake
> >> flatbuffers-config.cmake
> >>
> >>   Add the installation prefix of "Flatbuffers" to CMAKE_PREFIX_PATH or set
> >>   "Flatbuffers_DIR" to a directory containing one of the above files.  If
> >>   "Flatbuffers" provides a separate development package or SDK, be sure it
> >>   has been installed.
> >> Call Stack (most recent call first):
> >>   cmake_modules/ThirdpartyToolchain.cmake:1485 (resolve_dependency)
> >>   CMakeLists.txt:544 (include)
> >>
> >>
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 18/03/2019 à 13:51, Antoine Pitrou a écrit :
> >>>
> >>> Ok, so I have a problem.  I had the following line:
> >>>
> >>>   export DOUBLE_CONVERSION_HOME=
> >>>
> >>> which was used to force double-conversion to be built from source
> >>> despite other dependencies being taken from the Conda environment.  Now
> >>> it doesn't work anymore, and I haven't found how to emulate it.
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>> Le 15/03/2019 à 15:38, Uwe L. Korn a écrit :
> >>>> Hello fellow Arrow Devs,
> >>>>
> >>>> we have merged the CMake refactor yesterday 
> >>>> https://github.com/apache/arrow/pull/3688 and this means that the build 
> >>>> system behaves a bit different. The main differences are:
> >>>>
> >>>> * If you're in a conda environment, we automatically detect this using 
> >>>> the environment variable $CONDA_PREFIX and expect that all dependencies 
> >>>> (except jemalloc and ORC) are installed via conda.
> >>>> * Otherwise, we will look in the standard system paths for a dependency. 
> >>>> If it isn't found, we use CMake's ExternalProject mechanism to build it.
> >>>> * The *_HOME variables are not longer use and are replaced by *_ROOT 
> >>>> variables to use CMake's standard detection features. Be aware that 
> >>>> dependencies are no longer written in all caps but their preferred 
> >>>> casing as seen in 
> >>>> https://github.com/apache/arrow/blob/0d302125abb4b514dba210f496c574a77ce4cd1d/cpp/cmake_modules/ThirdpartyToolchain.cmake#L41-L59
> >>>> * You can manually select the way we detect dependencies via 
> >>>> ARROW_DEPENDENCY_SOURCE 
> >>>> https://github.com/apache/arrow/blob/0d302125abb4b514dba210f496c574a77ce4cd1d/cpp/CMakeLists.txt#L189-L207
> >>>>  The hope is that you as a developer should not normally need to change 
> >>>> this and as packager for distributions, you can use 
> >>>> `ARROW_DEPENDENCY_SOURCE=SYSTEM` to ensure that ExternalProject is not 
> >>>> used but only packages from the package manager. If your system is in a 
> >>>> non-default prefix, you can indicate this by setting 
> >>>> ARROW_PACKAGE_PREFIX.
> >>>>
> >>>> Also, please clear your existing CMake directories and do a fresh built 
> >>>> to avoid any problems. As well when you're using conda packages, please 
> >>>> update them all using `conda update --all` as I have errors in the 
> >>>> packaging directly on conda-forge instead of doing workarounds in our 
> >>>> CMake code. A helpful information is here that conda-forge now provides 
> >>>> a `compilers` package that provides the whole build toolchain.
> >>>>
> >>>> Uwe
> >>>>
> >>
>


  1   2   3   4   5   6   7   8   9   >