Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-14 Thread Micah Kornfield
Hi Wes,
Yes, I'm making progress and at this point I anticipate being able to
finish it off by next release, possibly without support for round tripping
fixed size lists.  I've been spending some time thinking about different
approaches and have started coding some of the building blocks, which I
think in the common case (relatively low nesting levels) should be fairly
performant (I'm also going to write some benchmarks to sanity check
this).  One caveat to this is my schedule is going to change slightly next
week and its possible my bandwidth might be more limited, I'll update the
list if this happens.

I think there are at least two areas that I'm not working on that could be
parallelized if you or your team has bandwidth.

1. It would be good to have some parquet files representing real world
datasets available to benchmark against.
2. The higher level book keeping of tracking which def-levels/rep-levels
are needed to compare against for any particular column (i.e. preceding
repeated parent).  I'm currently working on the code that takes these and
converts them to offsets/null fields.

I can go into more details if you or your team would like to collaborate.

Thanks,
Micah

On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney  wrote:

> hi Micah,
>
> I'm glad that we have the write side of nested completed for 0.17.0.
>
> As far as completing the read side and then implementing sufficient
> testing to exercise corner cases in end-to-end reads/writes, do you
> anticipate being able to work on this in the next 4-6 weeks (obviously
> the state of the world has affected everyone's availability /
> bandwidth)? I ask because someone from my team (or me also) may be
> able to get involved and help this move along. It'd be great to have
> this 100% completed and checked off our list for the next release
> (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
> tests get completed also)
>
> thanks
> Wes
>
> On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield 
> wrote:
> >>
> >> Glad to hear about the progress. As I mentioned on #2, what do you
> >> think about setting up a feature branch for you to merge PRs into?
> >> Then the branch can be iterated on and we can merge it back when it's
> >> feature complete and does not have perf regressions for the flat
> >> read/write path.
> >>
> > I'd like to avoid a separate branch if possible.  I'm willing to close
> the open PR till I'm sure it is needed but I'm hoping keeping PRs as small
> focused as possible with performance testing a long the way will be a
> better reviewer and developer experience here.
> >
> >> The earliest I'd have time to work on this myself would likely be
> >> sometime in March. Others are welcome to jump in as well (and it'd be
> >> great to increase the overall level of knowledge of the Parquet
> >> codebase)
> >
> > Hopefully, Igor can help out otherwise I'll take up the read path after
> I finish the write path.
> >
> > -Micah
> >
> > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney  wrote:
> >>
> >> hi Micah
> >>
> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield 
> wrote:
> >> >
> >> > Just to give an update.  I've been a little bit delayed, but my
> progress is
> >> > as follows:
> >> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
> >> > 2.  Have another PR open that allows a configuration option in C++ to
> >> > determine which algorithm version to use for reading/writing, the
> existing
> >> > version and the new version supported complex-nested arrays.  I think
> a
> >> > large amount of code will be reused/delegated to but I will err on
> the side
> >> > of not touching the existing code/algorithms so that any errors in the
> >> > implementation  or performance regressions can hopefully be mitigated
> at
> >> > runtime.  I expect in later releases (once the code has "baked") will
> >> > become a no-op.
> >>
> >> Glad to hear about the progress. As I mentioned on #2, what do you
> >> think about setting up a feature branch for you to merge PRs into?
> >> Then the branch can be iterated on and we can merge it back when it's
> >> feature complete and does not have perf regressions for the flat
> >> read/write path.
> >>
> >> > 3.  Started coding the write path.
> >> >
> >> > Which leaves:
> >> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> complete
> >> > 2.  Implementing the read path.
> >>
> >> The earliest I'd have time to work on this myself would likely be
> >> sometime in March. Others are welcome to jump in as well (and it'd be
> >> great to increase the overall level of knowledge of the Parquet
> >> codebase)
> >>
> >> > Again, I'm happy to collaborate if people have bandwidth and want to
> >> > contribute.
> >> >
> >> > Thanks,
> >> > Micah
> >> >
> >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi Wes,
> >> > > I'm still interested in doing the work.  But don't to hold anybody
> up if
> >> > > they have bandwidth.
> >> > >
> 

[jira] [Created] (ARROW-8466) [Packaging] The python unittests are not running in the windows wheel builds

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8466:
--

 Summary: [Packaging] The python unittests are not running in the 
windows wheel builds
 Key: ARROW-8466
 URL: https://issues.apache.org/jira/browse/ARROW-8466
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Krisztian Szucs


Appveyors log swallows why those tests are not running. Requires investigation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8465) [Packaging][Python] Windows py35 wheel build fails because of boost

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8465:
--

 Summary: [Packaging][Python] Windows py35 wheel build fails 
because of boost
 Key: ARROW-8465
 URL: https://issues.apache.org/jira/browse/ARROW-8465
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0


See build log 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-appveyor-wheel-win-cp35m



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-3

2020-04-14 Thread Crossbow


Arrow Build Report for Job nightly-2020-04-14-3

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3

Failed Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-6-amd64
- homebrew-cpp-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-cpp-autobrew
- test-conda-cpp-hiveserver2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp-hiveserver2
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-hdfs-2.9.2
- ubuntu-focal-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-ubuntu-focal-amd64
- ubuntu-xenial-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-ubuntu-xenial-amd64
- wheel-manylinux2014-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-wheel-manylinux2014-cp36m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-appveyor-wheel-win-cp35m

Pending Tasks:
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-spark-master

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-centos-8-amd64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-debian-buster-amd64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-github-debian-stretch-amd64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-gandiva-jar-osx
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-gandiva-jar-xenial
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-cpp
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-pandas-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-3-circle-test-conda-python-3.7-pandas-latest
- test-conda-python-3.7-pandas-master:
  URL: 

[jira] [Created] (ARROW-8464) [Rust] [DataFusion] Add support for dictionary types

2020-04-14 Thread Andy Grove (Jira)
Andy Grove created ARROW-8464:
-

 Summary: [Rust] [DataFusion] Add support for dictionary types
 Key: ARROW-8464
 URL: https://issues.apache.org/jira/browse/ARROW-8464
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Andy Grove


 
 * BatchIterator should accept both DictionaryBatch and RecordBatch
 * Type Coercion optimizer rule should inject expression for converting 
dictionary value types to index types (for equality expressions, and IN(values, 
...)
 * Physical expression would lookup index for dictionary values referenced in 
the query so that at runtime, only indices are being compared per batch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8463) [CI] Balance the nightly test builds between CircleCI, Azure and Github

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8463:
--

 Summary: [CI] Balance the nightly test builds between CircleCI, 
Azure and Github
 Key: ARROW-8463
 URL: https://issues.apache.org/jira/browse/ARROW-8463
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Krisztian Szucs


Most of our nightly docker builds are running on circleci and its queuing, so 
try to offload some of the builds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-14 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-8462:
-

 Summary: Crash in lib.concat_tables on Windows
 Key: ARROW-8462
 URL: https://issues.apache.org/jira/browse/ARROW-8462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Tom Augspurger


This crashes for me with pyarrow 0.16 on my Windows VM


{{
import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{
concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Wes McKinney
hi Remi,

It's no problem, it's a common question we get. Some developers
believe as a matter of principle that large projects should be broken
up into many smaller repositories.

Arrow is a different than many open source projects. Maintaining
protocol-level interoperability (although note that Rust does not yet
participate in the integration tests) has been a great deal of effort,
and the community has felt that trying to coordinate changes that
impact interoperability is substantially simpler in a monorepo
arrangement on GitHub. That we always know with relative certainty
whether any pull request may break interoperability between one
component and another. It's very easy to get into a situation where
you have a mess of cross-repository (or even circular) build and
runtime dependencies -- the monorepo makes all of this pain go away.
If you have a change that affects multiple repositories, CI tools
don't make it easy to test those PRs together, generally you'll just
see that a PR on one repo is breaking against the master of the other
repository.

In some cases, components may not have integrations with other
languages but that may not always be the case in the future. We have
just developed the C interface, for example, which would enable
DataFusion to be built as a shared library and imported in Python (if
someone wanted to do that).

Another dimension is that all of the PLs and components have benefited
greatly from the community's investment in CI and packaging
infrastructure.

I also believe that the project's common PR queue helps create a sense
of community awareness and solidarity amongst projects contributors.
If Rust were working off in their own corner of GitHub, I think it
would be easy for people who are not working on Rust to ignore them. I
think the net result of the way that we currently operate is that
we're producing higher quality software and have a healthier community
than we would otherwise with a more fragmented approach.

Lastly, the shared release cycle creates social pressure to get
patches finished and merged. Anecdotally this seems to be effective.

On the governance questions, see the roles section on
https://www.apache.org/foundation/how-it-works.html#roles

If a part of apache/arrow truly believed that they were being hindered
by being a part of monorepo, we could create a new repository under
apache/ on GitHub for the part that wants to split into a standalone
GitHub repository. That wouldn't change the governance of that code.

- Wes

On Tue, Apr 14, 2020 at 1:26 PM Rémi Dettai  wrote:
>
> This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.
>
> First thanks for your answer!
>
> It's true that I was also surprised to see all implementations of Arrow
> mixed up in a single repository!
>
> I was really considering the separation of the repositories as a mean to
> separate concerns. I am not 100% sure to understand how it would fragment
> the community but I think I get the point, even though I still believe that
> it is at the cost of extra complexity.
>
> As for the legal protection, I did not take that aspect into consideration,
> and I find it very interesting! What is the PMC exactly and why would
> Datafusion be more exposed in a separate repository?


[jira] [Created] (ARROW-8461) [Packaging][deb] Use zstd package for Ubuntu Xenial

2020-04-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8461:
---

 Summary: [Packaging][deb] Use zstd package for Ubuntu Xenial
 Key: ARROW-8461
 URL: https://issues.apache.org/jira/browse/ARROW-8461
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8460) [Packaging][deb] Ubuntu Focal build is failed

2020-04-14 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-8460:
---

 Summary: [Packaging][deb] Ubuntu Focal build is failed
 Key: ARROW-8460
 URL: https://issues.apache.org/jira/browse/ARROW-8460
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.17.0


It seems that this is no disk space error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8459) [Dev][Archery] Use a more recent cmake-format

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8459:
--

 Summary: [Dev][Archery] Use a more recent cmake-format
 Key: ARROW-8459
 URL: https://issues.apache.org/jira/browse/ARROW-8459
 Project: Apache Arrow
  Issue Type: Task
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 1.0.0


Reading through the cmake-format releases page it seems to contain improvements.

Additionally we should check cmake-format's version in run-cmake-format.py to 
have unified behaviour both locally and on the CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8458) [C++] Prefer the original mirrors for the bundled thirdparty dependencies

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8458:
--

 Summary: [C++] Prefer the original mirrors for the bundled 
thirdparty dependencies
 Key: ARROW-8458
 URL: https://issues.apache.org/jira/browse/ARROW-8458
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8457) [C++] bridge test does not take care of endianness

2020-04-14 Thread Kazuaki Ishizaki (Jira)
Kazuaki Ishizaki created ARROW-8457:
---

 Summary: [C++] bridge test does not take care of endianness
 Key: ARROW-8457
 URL: https://issues.apache.org/jira/browse/ARROW-8457
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kazuaki Ishizaki


According to the 
[specification|https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst]
 of ArrowSchema, memory format uses the native endian of the CPU. However, the 
test cases assume only little endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Follow up on ARROW-8451, datafusion part of Arrow

2020-04-14 Thread Rémi Dettai
This is a follow up on https://issues.apache.org/jira/browse/ARROW-8451.

First thanks for your answer!

It's true that I was also surprised to see all implementations of Arrow
mixed up in a single repository!

I was really considering the separation of the repositories as a mean to
separate concerns. I am not 100% sure to understand how it would fragment
the community but I think I get the point, even though I still believe that
it is at the cost of extra complexity.

As for the legal protection, I did not take that aspect into consideration,
and I find it very interesting! What is the PMC exactly and why would
Datafusion be more exposed in a separate repository?


[jira] [Created] (ARROW-8456) [Release] Add python script to help curating JIRA

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8456:
--

 Summary: [Release] Add python script to help curating JIRA
 Key: ARROW-8456
 URL: https://issues.apache.org/jira/browse/ARROW-8456
 Project: Apache Arrow
  Issue Type: Task
  Components: Developer Tools
Reporter: Krisztian Szucs
 Fix For: 1.0.0


The following script produces reports like 
https://gist.github.com/kszucs/9857ef69c92a230ce5a5068551b83ed8

{code:python}
from jira import JIRA
import warnings
import pygit2
import pandas as pd
from io import StringIO


class Patch:

def __init__(self, commit):
self.commit = commit
self.issue_key, self.msg = self._parse(commit.message)

def _parse(self, message):
first_line = message.splitlines()[0]

m = re.match("(?P((ARROW|PARQUET)\-\d+)):?(?P.*)", 
first_line)
if m is None:
return None, ''

values = m.groupdict()
return values['ticket'], values['msg']

@property
def shortmessage(self):
if not self.msg:
return self.commit.message.splitlines()[0]
else:
return self.msg

@property
def sha(self):
return self.commit.id

@property
def issue_url(self):
return 'https://issues.apache.org/jira/browse/{}'.format(self.issue_key)

@property
def commit_url(self):
return 'https://github.com/apache/arrow/commit/{}'.format(self.sha)

def to_markdown(self):
if self.issue_key is None:
return "[{}]({})\n".format(
self.shortmessage, 
self.commit_url
)
else:
return "[{}]({}): [{}]({})\n".format(
self.issue_key, 
self.issue_url, 
self.shortmessage, 
self.commit_url
)


JIRA_SEARCH_LIMIT = 1
# JIRA_SEARCH_LIMIT = 50


class Release:
"""Release object for querying issues and commits

Usage:
jira = JIRA(
{'server': 'https://issues.apache.org/jira'}, 
basic_auth=(user, password)
)
repo = pygit2.Repository('path/to/arrow/repo')

release = Release(jira, repo, '0.15.1', '0.15.0')
# show the commits in application order
for commit in release.commits():
print(commit.oid)
# cherry-pick the patches to a branch
release.apply_patches_to('a-branch')
"""

def __init__(self, jira, repo, version, previous_version):
self.jira = jira
self.repo = repo
self.version = version
self.previous_version = previous_version
self._issues = None
self._patches = None

def _tag(self, version):
return self.repo.revparse_single(f'refs/tags/apache-arrow-{version}')

def issues(self):
# FIXME(kszucs): paginate instead of maxresults 
if self._issues is None:
query = f'project=ARROW AND fixVersion={self.version}'
self._issues = self.jira.search_issues(query, 
maxResults=JIRA_SEARCH_LIMIT)
return self._issues

def patches(self):
"""Commits belonging to release applied on master branch

The returned commits' order corresponds to the output of
git log.
"""
if self._patches is None:
previous_tag = self._tag(self.previous_version)
master = self.repo.branches['master']
ordering = pygit2.GIT_SORT_TOPOLOGICAL | pygit2.GIT_SORT_REVERSE
walker = self.repo.walk(master.target, ordering)
walker.hide(previous_tag.oid)
self._patches = list(map(Patch, walker))

return self._patches

def curate(self):
issues = self.issues()
patches = self.patches()
issue_keys = {issue.key for issue in self.issues()}

within, outside, nojira = [], [], []
for p in patches:
if p.issue_key is None:
nojira.append(p)
elif p.issue_key in issue_keys:
within.append(p)
issue_keys.remove(p.issue_key)
else:
outside.append(p)

# remaining jira tickets
nopatch = list(issue_keys)

return within, outside, nojira, nopatch

def curation_report(self):
out = StringIO()

out.write('Total number of JIRA tickets assigned to version {}: {}\n'
  .format(self.version, len(self.issues(
out.write('\n')
out.write('Total number of applied patches since {}: {}\n'
  .format(self.previous_version, len(self.patches(

out.write('\n\n')

within, outside, nojira, nopatch = self.curate()

out.write('Patches with assigned 

[jira] [Created] (ARROW-8455) [Rust] Parquet Arrow column read on partially compatible files

2020-04-14 Thread Remi Dettai (Jira)
Remi Dettai created ARROW-8455:
--

 Summary: [Rust] Parquet Arrow column read on partially compatible 
files
 Key: ARROW-8455
 URL: https://issues.apache.org/jira/browse/ARROW-8455
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Affects Versions: 0.15.1
Reporter: Remi Dettai


Seen behavior: When reading a Parquet file into Arrow with 
`get_record_reader_by_columns`, it will fail if one of the column of the file 
is a list (or any other unsupported type).

Expected behavior: it should only fail if you are actually reading the column 
with unsuported type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8454) [CI] Add 3rdparty Apache dependency tarballs to github

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8454:
--

 Summary: [CI] Add 3rdparty Apache dependency tarballs to github
 Key: ARROW-8454
 URL: https://issues.apache.org/jira/browse/ARROW-8454
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Krisztian Szucs


Follow-up on https://github.com/apache/arrow/pull/6922#issuecomment-613527789



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-2

2020-04-14 Thread Crossbow


Arrow Build Report for Job nightly-2020-04-14-2

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2

Failed Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-6-amd64
- gandiva-jar-osx:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-gandiva-jar-osx
- homebrew-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-cpp
- test-conda-cpp-hiveserver2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp-hiveserver2
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-turbodbc-master
- ubuntu-focal-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-ubuntu-focal-amd64
- ubuntu-xenial-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-ubuntu-xenial-amd64
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-wheel-osx-cp36m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-appveyor-wheel-win-cp35m

Pending Tasks:
- test-debian-ruby:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-debian-ruby
- test-fedora-30-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-fedora-30-python-3

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-centos-8-amd64
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py36
- conda-linux-gcc-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py37
- conda-linux-gcc-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-linux-gcc-py38
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py36
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py37
- conda-osx-clang-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-osx-clang-py38
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py36
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py37
- conda-win-vs2015-py38:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-azure-conda-win-vs2015-py38
- debian-buster-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-debian-buster-amd64
- debian-stretch-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-github-debian-stretch-amd64
- gandiva-jar-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-gandiva-jar-xenial
- homebrew-cpp-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-cpp-autobrew
- homebrew-r-autobrew:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-travis-homebrew-r-autobrew
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-2-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 

[jira] [Created] (ARROW-8453) [Integration][Go] Recursive nested types unsupported

2020-04-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8453:
-

 Summary: [Integration][Go] Recursive nested types unsupported
 Key: ARROW-8453
 URL: https://issues.apache.org/jira/browse/ARROW-8453
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Integration
Reporter: Antoine Pitrou


The Go JSON integration implementation doesn't support recursive nested types, 
e.g. "list(list(int32))".

Here is an example traceback when Go is the consumer:
{code}
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1687c, 
0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x1710
github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc16858, 
0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x838
github.com/apache/arrow/go/arrow/internal/arrjson.fieldFromJSON(0xc16860, 
0xb, 0xc16858, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:309 +0xb5
github.com/apache/arrow/go/arrow/internal/arrjson.fieldsFromJSON(0xcca280, 
0x4, 0x4, 0x0, 0x6f6d08, 0xc0db60)
/arrow/go/arrow/internal/arrjson/arrjson.go:301 +0xfe
github.com/apache/arrow/go/arrow/internal/arrjson.schemaFromJSON(0xcca280, 
0x4, 0x4, 0xc0db60)
/arrow/go/arrow/internal/arrjson/arrjson.go:274 +0x3f
github.com/apache/arrow/go/arrow/internal/arrjson.NewReader(0x5b4700, 
0xc0e028, 0x0, 0x0, 0x0, 0x0, 0x0, 0xd0)
/arrow/go/arrow/internal/arrjson/reader.go:56 +0x13d
main.validate(0x7ffbc819, 0x37, 0x7ffbc857, 0x26, 0x4acf01, 0x0, 0x0)
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:181 +0x1c8
main.runCommand(0x7ffbc857, 0x26, 0x7ffbc819, 0x37, 0x7ffbc884, 
0x8, 0xc16101, 0xc86260, 0x40568f)
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:65 +0x228
main.main()
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:44 +0x24e
{code}

When Go is the producer:
{code}
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1687c, 
0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x1710
github.com/apache/arrow/go/arrow/internal/arrjson.dtypeFromJSON(0xc1686c, 
0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:238 +0x838
github.com/apache/arrow/go/arrow/internal/arrjson.fieldFromJSON(0xc16860, 
0xb, 0xc1686c, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/arrow/go/arrow/internal/arrjson/arrjson.go:309 +0xb5
github.com/apache/arrow/go/arrow/internal/arrjson.fieldsFromJSON(0xcca280, 
0x4, 0x4, 0x0, 0x6f6d08, 0xc0db60)
/arrow/go/arrow/internal/arrjson/arrjson.go:301 +0xfe
github.com/apache/arrow/go/arrow/internal/arrjson.schemaFromJSON(0xcca280, 
0x4, 0x4, 0xc0db60)
/arrow/go/arrow/internal/arrjson/arrjson.go:274 +0x3f
github.com/apache/arrow/go/arrow/internal/arrjson.NewReader(0x5b4700, 
0xc0e028, 0x0, 0x0, 0x0, 0x0, 0x0, 0xcc37a1760fc5b719)
/arrow/go/arrow/internal/arrjson/reader.go:56 +0x13d
main.cnvToARROW(0x7ffbc814, 0x37, 0x7ffbc852, 0x26, 0x4acf01, 0x0, 0x0)
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:137 +0x319
main.runCommand(0x7ffbc852, 0x26, 0x7ffbc814, 0x37, 0x7ffbc87f, 
0xd, 0xc16101, 0xc86260, 0x40568f)
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:63 +0x172
main.main()
/arrow/go/arrow/ipc/cmd/arrow-json-integration-test/main.go:44 +0x24e
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8452) [Go][Integration] Go JSON producer generates incorrect nullable flag for nested types

2020-04-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8452:
-

 Summary: [Go][Integration] Go JSON producer generates incorrect 
nullable flag for nested types
 Key: ARROW-8452
 URL: https://issues.apache.org/jira/browse/ARROW-8452
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Integration
Reporter: Antoine Pitrou


It seems that when generating JSON integration data for a nested type, e.g.
"list(int32)", the list's nullable flag is also inherited by child fields. This 
is wrong, because child fields have independent nullable flags, e.g. you may 
have:
* "list(field("ints", int32, nullable=True), nullable=True)"
* "list(field("ints", int32, nullable=False), nullable=True)"
* "list(field("ints", int32, nullable=True), nullable=False)"
* "list(field("ints", int32, nullable=False), nullable=False)"




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8451) [Rust] [Datafusion]

2020-04-14 Thread Remi Dettai (Jira)
Remi Dettai created ARROW-8451:
--

 Summary: [Rust] [Datafusion] 
 Key: ARROW-8451
 URL: https://issues.apache.org/jira/browse/ARROW-8451
 Project: Apache Arrow
  Issue Type: Wish
  Components: Rust - DataFusion
Reporter: Remi Dettai


Datafusion is a great example of how to use Arrow. But having Datafusion inside 
the Arrow project has several drawbacks:
 * longer build times (rust build already slow)
 * more frequent updates (creates noise)
 * its roadmap can be quite independent of that of Arrow

What is the actual benefit of having Datafusion inside the Arrow repo?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8450) [Integration][C++] Implement large list/binary/utf8 integration

2020-04-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8450:
-

 Summary: [Integration][C++] Implement large list/binary/utf8 
integration
 Key: ARROW-8450
 URL: https://issues.apache.org/jira/browse/ARROW-8450
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Integration
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8449) [R] Use CMAKE_UNITY_BUILD everywhere

2020-04-14 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-8449:
--

 Summary: [R] Use CMAKE_UNITY_BUILD everywhere
 Key: ARROW-8449
 URL: https://issues.apache.org/jira/browse/ARROW-8449
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging, R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8448) [Package] Can't build apt packages with ubuntu-focal

2020-04-14 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8448:
-

 Summary: [Package] Can't build apt packages with ubuntu-focal
 Key: ARROW-8448
 URL: https://issues.apache.org/jira/browse/ARROW-8448
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Francois Saint-Jacques
Assignee: Kouhei Sutou


While trying to debug the failing nightly (due to disk space), I encounter the 
following error, the tar generated by the build script does not conform to what 
debuilder expects. It blocks
{code}
Unable to find source-code formatter for language: shell. Available languages 
are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, 
groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, 
php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, 
yamlSuccessfully built ecdda7ea015d
Successfully tagged apache-arrow-ubuntu-focal:latest
docker run --rm --tty --volume 
/home/fsaintjacques/src/db/arrow/dev/tasks/linux-packages/apache-arrow/apt:/host:rw
 --env DEBUG=yes apache-arrow-ubuntu-focal /host/build.sh
This package has a Debian revision number but there does not seem to be
an appropriate original tar file or .orig directory in the parent directory;
(expected one of apache-arrow_0.16.0.orig.tar.gz, 
apache-arrow_0.16.0.orig.tar.bz2,
apache-arrow_0.16.0.orig.tar.lzma,  apache-arrow_0.16.0.orig.tar.xz or 
apache-arrow-1.0.0~dev20200414.orig)
continue anyway? (y/n) 

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

2020-04-14 Thread Wes McKinney
hi Micah,

I'm glad that we have the write side of nested completed for 0.17.0.

As far as completing the read side and then implementing sufficient
testing to exercise corner cases in end-to-end reads/writes, do you
anticipate being able to work on this in the next 4-6 weeks (obviously
the state of the world has affected everyone's availability /
bandwidth)? I ask because someone from my team (or me also) may be
able to get involved and help this move along. It'd be great to have
this 100% completed and checked off our list for the next release
(i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
tests get completed also)

thanks
Wes

On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield  wrote:
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
> I'd like to avoid a separate branch if possible.  I'm willing to close the 
> open PR till I'm sure it is needed but I'm hoping keeping PRs as small 
> focused as possible with performance testing a long the way will be a better 
> reviewer and developer experience here.
>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>
> Hopefully, Igor can help out otherwise I'll take up the read path after I 
> finish the write path.
>
> -Micah
>
> On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney  wrote:
>>
>> hi Micah
>>
>> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield  
>> wrote:
>> >
>> > Just to give an update.  I've been a little bit delayed, but my progress is
>> > as follows:
>> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
>> > 2.  Have another PR open that allows a configuration option in C++ to
>> > determine which algorithm version to use for reading/writing, the existing
>> > version and the new version supported complex-nested arrays.  I think a
>> > large amount of code will be reused/delegated to but I will err on the side
>> > of not touching the existing code/algorithms so that any errors in the
>> > implementation  or performance regressions can hopefully be mitigated at
>> > runtime.  I expect in later releases (once the code has "baked") will
>> > become a no-op.
>>
>> Glad to hear about the progress. As I mentioned on #2, what do you
>> think about setting up a feature branch for you to merge PRs into?
>> Then the branch can be iterated on and we can merge it back when it's
>> feature complete and does not have perf regressions for the flat
>> read/write path.
>>
>> > 3.  Started coding the write path.
>> >
>> > Which leaves:
>> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete
>> > 2.  Implementing the read path.
>>
>> The earliest I'd have time to work on this myself would likely be
>> sometime in March. Others are welcome to jump in as well (and it'd be
>> great to increase the overall level of knowledge of the Parquet
>> codebase)
>>
>> > Again, I'm happy to collaborate if people have bandwidth and want to
>> > contribute.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield 
>> > wrote:
>> >
>> > > Hi Wes,
>> > > I'm still interested in doing the work.  But don't to hold anybody up if
>> > > they have bandwidth.
>> > >
>> > > In order to actually make progress on this, my plan will be to:
>> > > 1.  Help with the current Java review backlog through early next week or
>> > > so (this has been taking the majority of my time allocated for Arrow
>> > > contributions for the last 6 months or so).
>> > > 2.  Shift all my attention to trying to get this done (this means no
>> > > reviews other then closing out existing ones that I've started until it 
>> > > is
>> > > done).  Hopefully, other Java committers can help shrink the backlog
>> > > further (Jacques thanks for you recent efforts here).
>> > >
>> > > Thanks,
>> > > Micah
>> > >
>> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney  wrote:
>> > >
>> > >> hi folks,
>> > >>
>> > >> I think we have reached a point where the incomplete C++ Parquet
>> > >> nested data assembly/disassembly is harming the value of several
>> > >> others parts of the project, for example the Datasets API. As another
>> > >> example, it's possible to ingest nested data from JSON but not write
>> > >> it to Parquet in general.
>> > >>
>> > >> Implementing the nested data read and write path completely is a
>> > >> difficult project requiring at least several weeks of dedicated work,
>> > >> so it's not so surprising that it hasn't been accomplished yet. I know
>> > >> that several people have expressed interest in working on it, but I
>> > >> would like to see if anyone would be able to volunteer a commitment of
>> > >> 

[jira] [Created] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering

2020-04-14 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-8447:
-

 Summary: [C++][Dataset] Ensure Scanner::ToTable preserve ordering
 Key: ARROW-8447
 URL: https://issues.apache.org/jira/browse/ARROW-8447
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


This can be refactored with a little effort in Scanner::ToTable:

# Change `batches` to `std::vector`
# When pushing the closure to the TaskGroup, also track an incrementing 
integer, e.g. scan_task_id
# In the closure, store the RecordBatches for this ScanTask in a local vector, 
when all batches are consumed, move the local vector in the `batches` at the 
right index, resizing and emplacing with mutex
# After waiting for the task group completion either
* Concatenate into a single vector and call `Table::FromRecordBatch` or
* Write a RecordBatchReader that supports vector and add 
method `Table::FromRecordBatchReader`

The later involves more work but is the clean way, the other FromRecordBatch 
method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8446) [Python][Dataset] Detect and use _metadata file in a list of file paths

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8446:


 Summary: [Python][Dataset] Detect and use _metadata file in a list 
of file paths
 Key: ARROW-8446
 URL: https://issues.apache.org/jira/browse/ARROW-8446
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/dask/dask/pull/6047#discussion_r402391318

When specifying a directory to {{ParquetDataset}}, we will detect if a 
{{_metadata}} file is present in the directory and use that to populate the 
{{metadata}} attribute (and not include this file in the list of "pieces", 
since it does not include any data).
 
However, when passing a list of files to {{ParquetDataset}}, with one being 
"_metadata", the metadata attribute is not populated, and the "_metadata" path 
is included as one of the ParquetDatasetPiece objects instead (which leads to 
an ArrowIOError during the read of that piece).

We _could_ detect it in a list of paths as well.

Note, I mentioned {{ParquetDataset}}, but if working on this, we should 
probably directly do it in the datasets API-based version.  
Also, I labeled this as Python and not C++ for now, as this might be something 
that can be handled on the Python side (once the C++ side knows how to process 
this kind of metadata -> ARROW-8062)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8445) [Gandiva][UDF] Add a udf for gandiva to extract the first capture in regex.

2020-04-14 Thread ZMZ91 (Jira)
ZMZ91 created ARROW-8445:


 Summary: [Gandiva][UDF] Add a udf for gandiva to extract the first 
capture in regex.
 Key: ARROW-8445
 URL: https://issues.apache.org/jira/browse/ARROW-8445
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, C++ - Gandiva
Reporter: ZMZ91


add a gandiva udf to extract the first capture in regex
[https://github.com/apache/arrow/pull/6925]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8444) [Documentation] Fix spelling errors across the codebase

2020-04-14 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-8444:
--

 Summary: [Documentation] Fix spelling errors across the codebase
 Key: ARROW-8444
 URL: https://issues.apache.org/jira/browse/ARROW-8444
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8443) [Gandiva][C++] Fix round/truncate to no-op for special cases

2020-04-14 Thread Praveen Kumar (Jira)
Praveen Kumar created ARROW-8443:


 Summary: [Gandiva][C++] Fix round/truncate to no-op for special 
cases
 Key: ARROW-8443
 URL: https://issues.apache.org/jira/browse/ARROW-8443
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Gandiva
Affects Versions: 1.0.0
Reporter: Praveen Kumar


In cases for round and truncate where the target scale is greater than input 
scale then make the operation as a no-op.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8442) [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8442:


 Summary: [Python] NullType.to_pandas_dtype inconsisent with dtype 
returned in to_pandas/to_numpy
 Key: ARROW-8442
 URL: https://issues.apache.org/jira/browse/ARROW-8442
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


There is this behaviour of {{to_pandas_dtype}} returning float, while all 
actual conversions to numpy or pandas use object dtype:

{code}
In [23]: pa.null().to_pandas_dtype()

   
Out[23]: numpy.float64

In [24]: pa.array([], pa.null()).to_pandas()

   
Out[24]: Series([], dtype: object)

In [25]: pa.array([], pa.null()).to_numpy(zero_copy_only=False) 

   
Out[25]: array([], dtype=object)
{code}

So we should probably fix {{NullType.to_pandas_dtype}} to return object, which 
is used in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8441) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-04-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-8441:
-

 Summary: [C++] Fix crashes on invalid input (OSS-Fuzz)
 Key: ARROW-8441
 URL: https://issues.apache.org/jira/browse/ARROW-8441
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8440) Refine simd header files

2020-04-14 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8440:
---

 Summary: Refine simd header files
 Key: ARROW-8440
 URL: https://issues.apache.org/jira/browse/ARROW-8440
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai


This is a follow up of ARROW-8227. It aims to unify simd header files and 
simplify code.
Currently, sse header files are included in sse_util.h, neon header files in 
neon_util.h and avx header files included directly in c source file.  
sse_util.h/neon_util.h also contain crc code which is not used by  cpp files 
#include them.
It may be better to put all simd header files in a single simd.h, and move crc 
code to where they are used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8439) [Python] Filesystem docs are outdated

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8439:


 Summary: [Python] Filesystem docs are outdated
 Key: ARROW-8439
 URL: https://issues.apache.org/jira/browse/ARROW-8439
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2020-04-14-1

2020-04-14 Thread Crossbow


Arrow Build Report for Job nightly-2020-04-14-1

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1

Failed Tasks:
- centos-6-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-centos-6-amd64
- ubuntu-focal-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-ubuntu-focal-amd64
- ubuntu-xenial-amd64:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-github-ubuntu-xenial-amd64
- wheel-osx-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-travis-wheel-osx-cp36m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-appveyor-wheel-win-cp35m

Pending Tasks:
- test-conda-cpp-hiveserver2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp-hiveserver2
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp-valgrind
- test-conda-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-cpp
- test-conda-python-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.6
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-2.9.2:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-hdfs-2.9.2
- test-conda-python-3.7-kartothek-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-kartothek-latest
- test-conda-python-3.7-kartothek-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-kartothek-master
- test-conda-python-3.7-spark-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-spark-master
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.8-jpype
- test-conda-python-3.8:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-python-3.8
- test-conda-r-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-conda-r-3.6
- test-debian-10-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-cpp
- test-debian-10-go-1.12:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-go-1.12
- test-debian-10-python-3:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-10-python-3
- test-debian-c-glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-c-glib
- test-debian-ruby:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-debian-ruby
- test-fedora-30-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-fedora-30-cpp
- test-ubuntu-16.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-16.04-cpp
- test-ubuntu-18.04-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-cpp-release
- test-ubuntu-18.04-cpp:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-cpp
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-docs
- test-ubuntu-18.04-r-3.6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-18.04-r-3.6
- test-ubuntu-c-glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-c-glib
- test-ubuntu-ruby:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-circle-test-ubuntu-ruby
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-04-14-1-appveyor-wheel-win-cp36m

Succeeded Tasks:
- centos-7-amd64:
  URL: 

[jira] [Created] (ARROW-8438) [C++] arrow-io-memory-benchmark crashes

2020-04-14 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-8438:
---

 Summary: [C++] arrow-io-memory-benchmark crashes
 Key: ARROW-8438
 URL: https://issues.apache.org/jira/browse/ARROW-8438
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai


"arrow-io-memory-benchmark" SIGSEGV in latest code base. It worked at least 
when my last commit 8 days ago: b1d4c86eb28267525c52f436c3a096e70b8ef6e0

stack backtrace attached

(gdb) r
Starting program: /home/cyb/share/debug/arrow-io-memory-benchmark 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
(gdb) [New Thread 0x737ff700 (LWP 29065)]
2020-04-14 14:24:40
Running /home/cyb/share/debug/arrow-io-memory-benchmark
Run on (32 X 2100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x16)
  L1 Instruction 64K (x16)
  L2 Unified 512K (x16)
  L3 Unified 4096K (x16)
Load Average: 2.64, 4.39, 4.28
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may 
be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.

Thread 1 "arrow-io-memory" received signal SIGSEGV, Segmentation fault.
0x768e67c8 in arrow::Buffer::is_mutable (this=0x0) at 
../src/arrow/buffer.h:258
258 ../src/arrow/buffer.h: No such file or directory.
(gdb) bt
#0  0x768e67c8 in arrow::Buffer::is_mutable (this=0x0) at 
../src/arrow/buffer.h:258
#1  0x76c3c41a in 
arrow::io::FixedSizeBufferWriter::FixedSizeBufferWriterImpl::FixedSizeBufferWriterImpl
 (this=0x558921f0, buffer=std::shared_ptr (empty) = {...})
at ../src/arrow/io/memory.cc:164
#2  0x76c3a575 in 
arrow::io::FixedSizeBufferWriter::FixedSizeBufferWriter (this=0x7fffd660, 
buffer=std::shared_ptr (empty) = {...}, __in_chrg=, 
__vtt_parm=) at ../src/arrow/io/memory.cc:227
#3  0x555ebd00 in arrow::ParallelMemoryCopy (state=...) at 
../src/arrow/io/memory_benchmark.cc:303
#4  0x555f80d4 in benchmark::internal::FunctionBenchmark::Run 
(this=0x55891290, st=...)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_register.cc:496
#5  0x5564bcc7 in benchmark::internal::BenchmarkInstance::Run 
(this=0x558939c0, iters=10, thread_id=0, timer=0x7fffd7a0, 
manager=0x55894b70)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_api_internal.cc:10
#6  0x5562c0c8 in benchmark::internal::(anonymous 
namespace)::RunInThread (b=0x558939c0, iters=10, thread_id=0, 
manager=0x55894b70)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:119
#7  0x5562c95a in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::DoNIterations (this=0x7fffddc0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:214
#8  0x5562d0ac in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::DoOneRepetition (this=0x7fffddc0, 
repetition_index=0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:299
#9  0x5562c558 in benchmark::internal::(anonymous 
namespace)::BenchmarkRunner::BenchmarkRunner (this=0x7fffddc0, b_=..., 
complexity_reports_=0x7fffdef0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:161
#10 0x5562d47f in benchmark::internal::RunBenchmark (b=..., 
complexity_reports=0x7fffdef0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_runner.cc:355
#11 0x555f0ae6 in benchmark::internal::(anonymous 
namespace)::RunBenchmarks (benchmarks=std::vector of length 9, capacity 12 = 
{...}, display_reporter=0x55891510, file_reporter=0x0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:265
#12 0x555f13b6 in benchmark::RunSpecifiedBenchmarks 
(display_reporter=0x55891510, file_reporter=0x0)
at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:399
#13 0x555f0ef8 in benchmark::RunSpecifiedBenchmarks () at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark.cc:340
#14 0x555efc64 in main (argc=1, argv=0x7fffe398) at 
/home/cyb/arrow/cpp/debug/gbenchmark_ep-prefix/src/gbenchmark_ep/src/benchmark_main.cc:17




--
This message was sent by Atlassian Jira
(v8.3.4#803005)