Re: [Java] Arrow PR queue build up?

2019-08-08 Thread Micah Kornfield
I did a pass through most of the open PRs (I might have missed one or
two).  Most had at least a few minor comments so the backlog hasn't gone
down that much, but I expect most will be mergeable very soon.

On Thu, Aug 8, 2019 at 9:44 AM Micah Kornfield 
wrote:

> Not a full solution, but I've fallen behind a bit. I'm going to plan to
> spend some time tonight at least reviewing PRs I've already done the first
> pass on and I'll try to pickup some more.
>
> Having more engaged reviewers would be helpful though.
>
> Cheers,
> Micah
>
> On Thursday, August 8, 2019, Wes McKinney  wrote:
>
>> hi folks,
>>
>> Liya Fan and Ji Liu have about 24 open Java PRs between them if I
>> counted right -- it seems like the project is having a hard time
>> keeping up with code reviews and merging on these. It looks to me like
>> they are making a lot of material improvements to the Java library
>> where previously there had not been a lot of development, so I would
>> like to see PRs get merged faster -- any ideas how we might be able to
>> achieve that?  I know that Micah has been spending a lot of time
>> reviewing and giving feedback on these PRs so that is much appreciated
>>
>> Thanks,
>> Wes
>>
>


[jira] [Created] (ARROW-6183) [R] factor out tidyselect?

2019-08-08 Thread James Lamb (JIRA)
James Lamb created ARROW-6183:
-

 Summary: [R] factor out tidyselect?
 Key: ARROW-6183
 URL: https://issues.apache.org/jira/browse/ARROW-6183
 Project: Apache Arrow
  Issue Type: Wish
Reporter: James Lamb


I noticed tonight that several functions from the *tidyselect* package are 
re-exported by *arrow*. Why is this necessary? In my opinion, the *arrow* R 
package should strive to have as few dependencies as possible and should have 
no opinion about which parts of the R ecosystem ("tidy" or otherwise) are used 
with it.

I think it would be valuable to cut the *tidyselect* re-exports, and to make 
*feather::read_feather()*'s argument *col_select* take a character vector of 
column names instead of a "*tidyselect::vars_select()"* object. I think that 
would be more natural and would be intuitive for a broader group of R users.

Would you be open to removing *tidyselect* and changing 
*feather::read_feather()* this way?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Options for running the integration tests

2019-08-08 Thread paddy horan
Thanks Krisztián,

I’ll take a look at setting it up.

P

Get Outlook for iOS

From: Krisztián Szűcs 
Sent: Thursday, August 8, 2019 6:55 PM
To: dev@arrow.apache.org
Subject: Re: Options for running the integration tests

We indeed don't have a docker-compose image for the
"format integration" tests. We can either set it up with
ursabot or the docker-compose, the easiest solution
would be to have the python image as the base image
and install the other language backends, java, rust etc.
then simply run the "format integration" suite.

Because ursabot adoption is under discussion, setting
it up the the docker-compose would require a docker
image like:

```dockerfile
FROM python:3.6

RUN install java
RUN install rust
RUN other backends ...

CMD python arrow/integration/integration_test.py
```

and a corresponding entry in the docker-compose.yml.

On Thu, Aug 8, 2019 at 6:39 PM paddy horan  wrote:

> Thanks Antoine,
>
> > Personally I run C++ / Java integration tests locally, without any
> Docker image. But I wouldn't be able to run the other integration tests...
>
> Right this where I started but I figured it's better to use docker as I'm
> not too familiar with other tool chains and the number of languages
> supported is expanding all the time.  I'm thinking Krisztian is planning to
> solve this with "ursabot", I just wanted to make sure I wasn't missing
> anything.  I'll plug away with the "arrow_integration_xenial_base" image
> for now.
>
> Paddy
>
> 
> From: Antoine Pitrou 
> Sent: Thursday, August 8, 2019 10:50 AM
> To: dev@arrow.apache.org 
> Subject: Re: Options for running the integration tests
>
> On Wed, 7 Aug 2019 20:29:13 +
> paddy horan  wrote:
>
> > Hi All,
> >
> > I have been away from Arrow for a while due to relocation of family and
> RSI.  I'd like to start working toward getting Rust passing the integration
> tests.  In the last few months a lot of work has been done to "dockerize"
> many of the build steps in the project, which I'm trying to figure out.
> >
> > I started out using the 'arrow_integration_xenial_base' image and
> submitted a PR to allow it to be built from a windows host, but I noticed
> that there is a page in the pyarrow docs related to integration testing (
> https://arrow.apache.org/docs/developers/integration.html) that uses
> docker-compose from the top level of the project.
>
> That documentation page may be confusing things.  It's entitled
> "integration testing" but it doesn't seem to talk about integration
> tests in the Arrow sense, rather regular unit tests.
>
> > It seems that the 'arrow_integration_xenial_base' image is replaced
> > by this solution?
>
> I have no idea.  Perhaps Krisztian knows the answer?
> Personally I run C++ / Java integration tests locally, without any
> Docker image. But I wouldn't be able to run the other integration
> tests...
>
> Regards
>
> Antoine.
>
>
>


Re: Options for running the integration tests

2019-08-08 Thread Krisztián Szűcs
We indeed don't have a docker-compose image for the
"format integration" tests. We can either set it up with
ursabot or the docker-compose, the easiest solution
would be to have the python image as the base image
and install the other language backends, java, rust etc.
then simply run the "format integration" suite.

Because ursabot adoption is under discussion, setting
it up the the docker-compose would require a docker
image like:

```dockerfile
FROM python:3.6

RUN install java
RUN install rust
RUN other backends ...

CMD python arrow/integration/integration_test.py
```

and a corresponding entry in the docker-compose.yml.

On Thu, Aug 8, 2019 at 6:39 PM paddy horan  wrote:

> Thanks Antoine,
>
> > Personally I run C++ / Java integration tests locally, without any
> Docker image. But I wouldn't be able to run the other integration tests...
>
> Right this where I started but I figured it's better to use docker as I'm
> not too familiar with other tool chains and the number of languages
> supported is expanding all the time.  I'm thinking Krisztian is planning to
> solve this with "ursabot", I just wanted to make sure I wasn't missing
> anything.  I'll plug away with the "arrow_integration_xenial_base" image
> for now.
>
> Paddy
>
> 
> From: Antoine Pitrou 
> Sent: Thursday, August 8, 2019 10:50 AM
> To: dev@arrow.apache.org 
> Subject: Re: Options for running the integration tests
>
> On Wed, 7 Aug 2019 20:29:13 +
> paddy horan  wrote:
>
> > Hi All,
> >
> > I have been away from Arrow for a while due to relocation of family and
> RSI.  I'd like to start working toward getting Rust passing the integration
> tests.  In the last few months a lot of work has been done to "dockerize"
> many of the build steps in the project, which I'm trying to figure out.
> >
> > I started out using the 'arrow_integration_xenial_base' image and
> submitted a PR to allow it to be built from a windows host, but I noticed
> that there is a page in the pyarrow docs related to integration testing (
> https://arrow.apache.org/docs/developers/integration.html) that uses
> docker-compose from the top level of the project.
>
> That documentation page may be confusing things.  It's entitled
> "integration testing" but it doesn't seem to talk about integration
> tests in the Arrow sense, rather regular unit tests.
>
> > It seems that the 'arrow_integration_xenial_base' image is replaced
> > by this solution?
>
> I have no idea.  Perhaps Krisztian knows the answer?
> Personally I run C++ / Java integration tests locally, without any
> Docker image. But I wouldn't be able to run the other integration
> tests...
>
> Regards
>
> Antoine.
>
>
>


Re: Ursabot configuration within Arrow

2019-08-08 Thread Krisztián Szűcs
On Thu, Aug 8, 2019 at 4:24 PM Antoine Pitrou  wrote:

>
> Le 08/08/2019 à 16:12, Krisztián Szűcs a écrit :
> > Hi All!
> >
> > Ursabot now supports debugging failed builds with attach attaching
> > shells to the still builds right after a failing build step:
> >
> > $ ursabot project build --attach-on-failure `AMD64 Conda C++`
> >
> > And local source/git directories can also be mounted to the builder
> > instead of cloning arrow, this makes the debugging a lot easier:
> >
> > $ ursabot project build -s ~/Workspace/arrow:. 'AMD64 Conda C++'
> >
> > Mount destination `.` is relative to the build directory on the workers.
> >
> > The CI configuration for arrow is available here:
> > https://github.com/ursa-labs/ursabot/tree/master/projects/arrow
>
> As I've already said: most build configuration should *not* be in the
> buildmaster configuration.  Otherwise this forces a unique build
> configuration (for all branches, for all PRs) and this also forces to
> restart the buildmaster when changing the build configuration (which is
> not a good idea).
>
> Compare with Travis-CI or other services:
> - the CI configuration and scripts are local to the Arrow repository
>
This is the plan, to move the arrow configuration into the arrow repository
to be governed by the arrow community.

> - each PR or branch can change the CI configuration without impacting
> other builds
>
We can introduce an automatism for that, but it has security concerns.
Worst case scenario we can run the ursabot builders on any public CI
services, like we actually run the arrow builders on the ursabot repository:
https://travis-ci.org/ursa-labs/ursabot/builds/569364742

> - one can change the CI configuration without having to restart a global
> daemon or service
>
With a self-hosted infrastructure it is not that easy, at least involves
security
concerns. But we can still develop it, if it is a desired feature.

>
> Regards
>
> Antoine.
>


[jira] [Created] (ARROW-6182) [R] Package fails to load with error `CXXABI_1.3.11' not found

2019-08-08 Thread Ian Cook (JIRA)
Ian Cook created ARROW-6182:
---

 Summary: [R] Package fails to load with error `CXXABI_1.3.11' not 
found 
 Key: ARROW-6182
 URL: https://issues.apache.org/jira/browse/ARROW-6182
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.14.1
 Environment: Ubuntu 16.04.6
Reporter: Ian Cook


I'm able to successfully install the C++ and Python libraries from conda-forge, 
then successfully install the R package from CRAN if I use {{--no-test-load}}. 
But after installation, the R package fails to load because 
{{dyn.load("arrow.so")}} fails. It throws this error when loading:
{code:java}
unable to load shared object '~/R/arrow/libs/arrow.so':
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found 
(required by ~/.conda/envs/python3.6/lib/libarrow.so.14)
{code}
Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If not, 
what might explain this error message? Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6181) [R] Only allow R package to install without libarrow on linux

2019-08-08 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6181:
--

 Summary: [R] Only allow R package to install without libarrow on 
linux
 Key: ARROW-6181
 URL: https://issues.apache.org/jira/browse/ARROW-6181
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson


See https://issues.apache.org/jira/browse/ARROW-6167 for backstory. Now that 
we're on CRAN, we can be less paranoid about build failures getting the package 
rejected, and we can focus on solidifying the CRAN binary package experience. 
The macOS binaries for 0.14.1 were built without the C++ library, which we did 
not expect and cannot reproduce. At this point, it would probably be better to 
have a failed build than have binaries get made but be useless. Plus, word has 
it that for macOS binary builds, CRAN will retry if they fail for some reason. 
It's possible that whatever failed for 0.14.1 was transient, and if the build 
had failed instead of carried on without libarrow, on retry it may have built 
successfully.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-08 Thread Neal Richardson
I need a committer to make a master branch on arrow-site so that I can
PR to it. I thought it could be just an empty orphan branch but that
proved not to work, so a committer will need to do the following:

```
git clone g...@github.com:$YOURGITHUB/arrow.git arrow-copy
cd arrow-copy
git filter-branch --prune-empty --subdirectory-filter site master
vi .git/config
# Change remote "origin"'s URL to be g...@github.com:arrow/arrow-site.git
git push -f origin master
```

On Thu, Aug 8, 2019 at 12:07 PM Wes McKinney  wrote:
>
> Yes, I think we have adequate lazy consensus. Can you spell out what
> are the next steps?
>
> On Thu, Aug 8, 2019 at 2:01 PM Neal Richardson
>  wrote:
> >
> > Have we reached "lazy consensus" here? No further comments in the last
> > three days.
> >
> > Thanks,
> > Neal
> >
> > On Mon, Aug 5, 2019 at 1:46 PM Joris Van den Bossche
> >  wrote:
> > >
> > > This sounds as a good proposal to me (at least at the moment where we have
> > > separate docs and main site).
> > > I agree that documentation should indeed stay with the code, as you want 
> > > to
> > > update those together in PRs. But the website is something you can
> > > typically update separately and also might want to update independently
> > > from code releases. And certainly if this proposal makes it easier to work
> > > on the site, all the better.
> > >
> > > Joris
> > >
> > > Op ma 5 aug. 2019 20:30 schreef Wes McKinney :
> > >
> > > > Let's wait a little while to collect any additional opinions about this.
> > > >
> > > > There's pretty good evidence from other Apache projects that this
> > > > isn't too bad of an idea
> > > >
> > > > Apache Calcite: https://github.com/apache/calcite-site
> > > > Apache Kafka: https://github.com/apache/kafka-site
> > > > Apache Spark: https://github.com/apache/spark-website
> > > >
> > > > The Apache projects I've seen where the same repository is used for
> > > > $FOO.apache.org tend to be ones where the documentation _is_ the
> > > > website. I think we would need to commission a significant web design
> > > > overhaul to be able to make our documentation page adequate as the
> > > > landing point for visitors to https://arrow.apache.org.
> > > >
> > > > On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
> > > >  wrote:
> > > > >
> > > > > Given the status quo, it would be difficult for this to make the Arrow
> > > > > website less maintained. In fact, arrow-site is currently missing the
> > > > > most recent two patches that modified the site directory in
> > > > > apache/arrow. Having multiple manual deploy steps increases the
> > > > > likelihood that the website stays stale.
> > > > >
> > > > > As someone who has been working on the arrow site lately, this
> > > > > proposal makes it easier for me to make changes to the website because
> > > > > I can automatically deploy my changes to a test site, and that lets
> > > > > others in the community, who perhaps don't touch the website much,
> > > > > verify that they're good.
> > > > >
> > > > > I agree that the documentation situation needs attention, but as I
> > > > > said initially, that's orthogonal to this static site generation. I'd
> > > > > like to work on that next, and I think these changes will make it
> > > > > easier to do. I would not propose moving doc generation out of
> > > > > apache/arrow--that belongs with the code.
> > > > >
> > > > > Neal
> > > > >
> > > > > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  
> > > > > wrote:
> > > > > >
> > > > > > I think that the project website and the project documentation are
> > > > > > currently distinct entities. The current Jekyll website is 
> > > > > > independent
> > > > > > from the Sphinx documentation project aside from a link to the
> > > > > > documentation from the website.
> > > > > >
> > > > > > I am guessing that we would want to maintain some amount of 
> > > > > > separation
> > > > > > between the main site at arrow.apache.org and the code / format
> > > > > > documentation, at minimum because we may want to make documentation
> > > > > > available for multiple versions of the project (this has already 
> > > > > > been
> > > > > > cited as an issue -- when we release, we're overwriting the previous
> > > > > > version of the docs)
> > > > > >
> > > > > > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou 
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > I am concerned with this.  What happens if we happen to move part 
> > > > > > > of
> > > > the
> > > > > > > current site to e.g. the Sphinx docs in the Arrow repository (we
> > > > already
> > > > > > > did that, so it's not theoretical)?
> > > > > > >
> > > > > > > More generally, I also think that any move towards separating 
> > > > > > > website
> > > > > > > and code repo more will lead to an even less maintained website.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > Le 02/08/2019 à 22:39, Wes McKinney a écrit :
> > > > > > > > hi Neal,

[jira] [Created] (ARROW-6180) [C++] Create InputStream that references a read-only segment of a RandomAccessFile

2019-08-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6180:
---

 Summary: [C++] Create InputStream that references a read-only 
segment of a RandomAccessFile
 Key: ARROW-6180
 URL: https://issues.apache.org/jira/browse/ARROW-6180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


If different threads wants to do buffered reads over different portions of a 
file (and they are unable to create their own separate file handles), they may 
clobber each other. I would propose creating an object that keeps the 
RandomAccessFile internally and implements the InputStream API in a way that is 
safe from other threads changing the file position



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-08 Thread Wes McKinney
Yes, I think we have adequate lazy consensus. Can you spell out what
are the next steps?

On Thu, Aug 8, 2019 at 2:01 PM Neal Richardson
 wrote:
>
> Have we reached "lazy consensus" here? No further comments in the last
> three days.
>
> Thanks,
> Neal
>
> On Mon, Aug 5, 2019 at 1:46 PM Joris Van den Bossche
>  wrote:
> >
> > This sounds as a good proposal to me (at least at the moment where we have
> > separate docs and main site).
> > I agree that documentation should indeed stay with the code, as you want to
> > update those together in PRs. But the website is something you can
> > typically update separately and also might want to update independently
> > from code releases. And certainly if this proposal makes it easier to work
> > on the site, all the better.
> >
> > Joris
> >
> > Op ma 5 aug. 2019 20:30 schreef Wes McKinney :
> >
> > > Let's wait a little while to collect any additional opinions about this.
> > >
> > > There's pretty good evidence from other Apache projects that this
> > > isn't too bad of an idea
> > >
> > > Apache Calcite: https://github.com/apache/calcite-site
> > > Apache Kafka: https://github.com/apache/kafka-site
> > > Apache Spark: https://github.com/apache/spark-website
> > >
> > > The Apache projects I've seen where the same repository is used for
> > > $FOO.apache.org tend to be ones where the documentation _is_ the
> > > website. I think we would need to commission a significant web design
> > > overhaul to be able to make our documentation page adequate as the
> > > landing point for visitors to https://arrow.apache.org.
> > >
> > > On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
> > >  wrote:
> > > >
> > > > Given the status quo, it would be difficult for this to make the Arrow
> > > > website less maintained. In fact, arrow-site is currently missing the
> > > > most recent two patches that modified the site directory in
> > > > apache/arrow. Having multiple manual deploy steps increases the
> > > > likelihood that the website stays stale.
> > > >
> > > > As someone who has been working on the arrow site lately, this
> > > > proposal makes it easier for me to make changes to the website because
> > > > I can automatically deploy my changes to a test site, and that lets
> > > > others in the community, who perhaps don't touch the website much,
> > > > verify that they're good.
> > > >
> > > > I agree that the documentation situation needs attention, but as I
> > > > said initially, that's orthogonal to this static site generation. I'd
> > > > like to work on that next, and I think these changes will make it
> > > > easier to do. I would not propose moving doc generation out of
> > > > apache/arrow--that belongs with the code.
> > > >
> > > > Neal
> > > >
> > > > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  wrote:
> > > > >
> > > > > I think that the project website and the project documentation are
> > > > > currently distinct entities. The current Jekyll website is independent
> > > > > from the Sphinx documentation project aside from a link to the
> > > > > documentation from the website.
> > > > >
> > > > > I am guessing that we would want to maintain some amount of separation
> > > > > between the main site at arrow.apache.org and the code / format
> > > > > documentation, at minimum because we may want to make documentation
> > > > > available for multiple versions of the project (this has already been
> > > > > cited as an issue -- when we release, we're overwriting the previous
> > > > > version of the docs)
> > > > >
> > > > > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou 
> > > wrote:
> > > > > >
> > > > > >
> > > > > > I am concerned with this.  What happens if we happen to move part of
> > > the
> > > > > > current site to e.g. the Sphinx docs in the Arrow repository (we
> > > already
> > > > > > did that, so it's not theoretical)?
> > > > > >
> > > > > > More generally, I also think that any move towards separating 
> > > > > > website
> > > > > > and code repo more will lead to an even less maintained website.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > Le 02/08/2019 à 22:39, Wes McKinney a écrit :
> > > > > > > hi Neal,
> > > > > > >
> > > > > > > In general the improvements to the site sound good, and I agree
> > > with
> > > > > > > moving the site into the apache/arrow-site repository.
> > > > > > >
> > > > > > > It sounds like a committer will have to volunteer a PAT for the
> > > Travis
> > > > > > > CI settings in
> > > > > > >
> > > > > > > https://travis-ci.org/apache/arrow-site/settings
> > > > > > >
> > > > > > > Even though you can't get at such an environment variable there
> > > after
> > > > > > > it's set, it could still technically be compromised. Personally I
> > > > > > > wouldn't be comfortable having a token with "repo" scope out
> > > there. We
> > > > > > > might need to think about this some more -- the general idea of
> > > making
> > > > > > > it easier to deploy the website 

Re: Proposal to move website source to arrow-site, add automatic builds

2019-08-08 Thread Neal Richardson
Have we reached "lazy consensus" here? No further comments in the last
three days.

Thanks,
Neal

On Mon, Aug 5, 2019 at 1:46 PM Joris Van den Bossche
 wrote:
>
> This sounds as a good proposal to me (at least at the moment where we have
> separate docs and main site).
> I agree that documentation should indeed stay with the code, as you want to
> update those together in PRs. But the website is something you can
> typically update separately and also might want to update independently
> from code releases. And certainly if this proposal makes it easier to work
> on the site, all the better.
>
> Joris
>
> Op ma 5 aug. 2019 20:30 schreef Wes McKinney :
>
> > Let's wait a little while to collect any additional opinions about this.
> >
> > There's pretty good evidence from other Apache projects that this
> > isn't too bad of an idea
> >
> > Apache Calcite: https://github.com/apache/calcite-site
> > Apache Kafka: https://github.com/apache/kafka-site
> > Apache Spark: https://github.com/apache/spark-website
> >
> > The Apache projects I've seen where the same repository is used for
> > $FOO.apache.org tend to be ones where the documentation _is_ the
> > website. I think we would need to commission a significant web design
> > overhaul to be able to make our documentation page adequate as the
> > landing point for visitors to https://arrow.apache.org.
> >
> > On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson
> >  wrote:
> > >
> > > Given the status quo, it would be difficult for this to make the Arrow
> > > website less maintained. In fact, arrow-site is currently missing the
> > > most recent two patches that modified the site directory in
> > > apache/arrow. Having multiple manual deploy steps increases the
> > > likelihood that the website stays stale.
> > >
> > > As someone who has been working on the arrow site lately, this
> > > proposal makes it easier for me to make changes to the website because
> > > I can automatically deploy my changes to a test site, and that lets
> > > others in the community, who perhaps don't touch the website much,
> > > verify that they're good.
> > >
> > > I agree that the documentation situation needs attention, but as I
> > > said initially, that's orthogonal to this static site generation. I'd
> > > like to work on that next, and I think these changes will make it
> > > easier to do. I would not propose moving doc generation out of
> > > apache/arrow--that belongs with the code.
> > >
> > > Neal
> > >
> > > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney  wrote:
> > > >
> > > > I think that the project website and the project documentation are
> > > > currently distinct entities. The current Jekyll website is independent
> > > > from the Sphinx documentation project aside from a link to the
> > > > documentation from the website.
> > > >
> > > > I am guessing that we would want to maintain some amount of separation
> > > > between the main site at arrow.apache.org and the code / format
> > > > documentation, at minimum because we may want to make documentation
> > > > available for multiple versions of the project (this has already been
> > > > cited as an issue -- when we release, we're overwriting the previous
> > > > version of the docs)
> > > >
> > > > On Sat, Aug 3, 2019 at 11:33 AM Antoine Pitrou 
> > wrote:
> > > > >
> > > > >
> > > > > I am concerned with this.  What happens if we happen to move part of
> > the
> > > > > current site to e.g. the Sphinx docs in the Arrow repository (we
> > already
> > > > > did that, so it's not theoretical)?
> > > > >
> > > > > More generally, I also think that any move towards separating website
> > > > > and code repo more will lead to an even less maintained website.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 02/08/2019 à 22:39, Wes McKinney a écrit :
> > > > > > hi Neal,
> > > > > >
> > > > > > In general the improvements to the site sound good, and I agree
> > with
> > > > > > moving the site into the apache/arrow-site repository.
> > > > > >
> > > > > > It sounds like a committer will have to volunteer a PAT for the
> > Travis
> > > > > > CI settings in
> > > > > >
> > > > > > https://travis-ci.org/apache/arrow-site/settings
> > > > > >
> > > > > > Even though you can't get at such an environment variable there
> > after
> > > > > > it's set, it could still technically be compromised. Personally I
> > > > > > wouldn't be comfortable having a token with "repo" scope out
> > there. We
> > > > > > might need to think about this some more -- the general idea of
> > making
> > > > > > it easier to deploy the website I'm totally on board with
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 2, 2019 at 1:35 PM Neal Richardson
> > > > > >  wrote:
> > > > > >>
> > > > > >> Hi all,
> > > > > >> https://issues.apache.org/jira/browse/ARROW-5746 requested to
> > move the
> > > > > >> source for https://arrow.apache.org out of `apache/arrow` due to
> > the
> > > > > >> 

Re: [Java] Arrow PR queue build up?

2019-08-08 Thread Micah Kornfield
Not a full solution, but I've fallen behind a bit. I'm going to plan to
spend some time tonight at least reviewing PRs I've already done the first
pass on and I'll try to pickup some more.

Having more engaged reviewers would be helpful though.

Cheers,
Micah

On Thursday, August 8, 2019, Wes McKinney  wrote:

> hi folks,
>
> Liya Fan and Ji Liu have about 24 open Java PRs between them if I
> counted right -- it seems like the project is having a hard time
> keeping up with code reviews and merging on these. It looks to me like
> they are making a lot of material improvements to the Java library
> where previously there had not been a lot of development, so I would
> like to see PRs get merged faster -- any ideas how we might be able to
> achieve that?  I know that Micah has been spending a lot of time
> reviewing and giving feedback on these PRs so that is much appreciated
>
> Thanks,
> Wes
>


[jira] [Created] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6179:


 Summary: [C++] ExtensionType subclass for "unknown" types?
 Key: ARROW-6179
 URL: https://issues.apache.org/jira/browse/ARROW-6179
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


In C++, when receiving IPC with extension type metadata for a type that is 
unknown (the name is not registered), we currently fall back to returning the 
"raw" storage array. The custom metadata (extension name and metadata) is still 
available in the Field metadata.

Alternatively, we could also have a generic {{ExtensionType}} class that can 
hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
{{GenericExtensionType}}), keeping the extension name and metadata in the 
Array's type. 

This could be a single class where several instances can be created given a 
storage type, extension name and optionally extension metadata. It would be a 
way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6178) [Developer] Don't fail in merge script on bad primary author input in multi-author PRs

2019-08-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6178:
---

 Summary: [Developer] Don't fail in merge script on bad primary 
author input in multi-author PRs
 Key: ARROW-6178
 URL: https://issues.apache.org/jira/browse/ARROW-6178
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney


I was going on autopilot in a multi-author PR and this happened

{code}
Switched to branch 'PR_TOOL_MERGE_PR_5000_MASTER'
Automatic merge went well; stopped before committing as requested
Author 1: François Saint-Jacques 
Author 2: Wes McKinney 
Enter primary author in the format of "name " [François Saint-Jacques 
]: y
fatal: --author '"y"' is not 'Name ' and matches no existing author
Command failed: ['git', 'commit', '--no-verify', '--author="y"', '-m', 
'ARROW-6121: [Tools] Improve merge tool ergonomics', '-m', '- merge_arrow_pr.py 
now accepts the pull-request number as a single optional argument, e.g. 
`./merge_arrow_pr.py 4921`.\r\n- merge_arrow_pr.py can optionally read a 
configuration file located in   `~/.config/arrow/merge.conf` which contains 
options like jira credentials. See the `dev/merge.conf` file as example', '-m', 
'Closes #5000 from fsaintjacques/ARROW-6121-merge-ergonomic and squashes the 
following commits:', '-m', '5298308d7  Handle username/password 
separately (in case username is set but not password)\n581653735  Rename merge.conf to merge.conf.sample\n7c51ca8f0  Add license to config file\n1213946bd  
ARROW-6121:  Improve merge tool ergonomics', '-m', 'Lead-authored-by: 
y\nCo-authored-by: François Saint-Jacques 
\nCo-authored-by: Wes McKinney 
\nSigned-off-by: Wes McKinney ']
With output:
--
b''
--
Traceback (most recent call last):
  File "dev/merge_arrow_pr.py", line 530, in 
if pr.is_merged:
  File "dev/merge_arrow_pr.py", line 515, in cli
PROJECT_NAME = os.environ.get('ARROW_PROJECT_NAME') or 'arrow'
  File "dev/merge_arrow_pr.py", line 420, in merge
'--author="%s"' % primary_author] +
  File "dev/merge_arrow_pr.py", line 89, in run_cmd
print('--')
  File "dev/merge_arrow_pr.py", line 81, in run_cmd
try:
  File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", line 
395, in check_output
**kwargs).stdout
  File "/home/wesm/miniconda/envs/arrow-3.7/lib/python3.7/subprocess.py", line 
487, in run
output=stdout, stderr=stderr)
{code}

If the input does not match the expected format, we should loop to request 
input again rather than failing out (which requires messy manual cleanup of 
temporary branches)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Options for running the integration tests

2019-08-08 Thread paddy horan
Thanks Antoine,

> Personally I run C++ / Java integration tests locally, without any Docker 
> image. But I wouldn't be able to run the other integration tests...

Right this where I started but I figured it's better to use docker as I'm not 
too familiar with other tool chains and the number of languages supported is 
expanding all the time.  I'm thinking Krisztian is planning to solve this with 
"ursabot", I just wanted to make sure I wasn't missing anything.  I'll plug 
away with the "arrow_integration_xenial_base" image for now.

Paddy


From: Antoine Pitrou 
Sent: Thursday, August 8, 2019 10:50 AM
To: dev@arrow.apache.org 
Subject: Re: Options for running the integration tests

On Wed, 7 Aug 2019 20:29:13 +
paddy horan  wrote:

> Hi All,
>
> I have been away from Arrow for a while due to relocation of family and RSI.  
> I'd like to start working toward getting Rust passing the integration tests.  
> In the last few months a lot of work has been done to "dockerize" many of the 
> build steps in the project, which I'm trying to figure out.
>
> I started out using the 'arrow_integration_xenial_base' image and submitted a 
> PR to allow it to be built from a windows host, but I noticed that there is a 
> page in the pyarrow docs related to integration testing 
> (https://arrow.apache.org/docs/developers/integration.html) that uses 
> docker-compose from the top level of the project.

That documentation page may be confusing things.  It's entitled
"integration testing" but it doesn't seem to talk about integration
tests in the Arrow sense, rather regular unit tests.

> It seems that the 'arrow_integration_xenial_base' image is replaced
> by this solution?

I have no idea.  Perhaps Krisztian knows the answer?
Personally I run C++ / Java integration tests locally, without any
Docker image. But I wouldn't be able to run the other integration
tests...

Regards

Antoine.




[Java] Arrow PR queue build up?

2019-08-08 Thread Wes McKinney
hi folks,

Liya Fan and Ji Liu have about 24 open Java PRs between them if I
counted right -- it seems like the project is having a hard time
keeping up with code reviews and merging on these. It looks to me like
they are making a lot of material improvements to the Java library
where previously there had not been a lot of development, so I would
like to see PRs get merged faster -- any ideas how we might be able to
achieve that?  I know that Micah has been spending a lot of time
reviewing and giving feedback on these PRs so that is much appreciated

Thanks,
Wes


[jira] [Created] (ARROW-6177) [C++] Add Arrow::Validate()

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6177:
-

 Summary: [C++] Add Arrow::Validate()
 Key: ARROW-6177
 URL: https://issues.apache.org/jira/browse/ARROW-6177
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.14.1
Reporter: Antoine Pitrou


It's a bit weird to have {{ChunkedArray::Validate()}} and {{Table::Validate()}} 
methods but only a standalone {{ValidateArray}} function for arrays.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6176:


 Summary: [Python] Allow to subclass ExtensionArray to attach to 
custom extension type
 Key: ARROW-6176
 URL: https://issues.apache.org/jira/browse/ARROW-6176
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, you can define a custom extension type in Python with 

{code}
class UuidType(pa.ExtensionType):

def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(16))

def __reduce__(self):
return UuidType, ()
{code}

but the array you can create with this is always ExtensionArray. We should 
provide a way to define a subclass (eg `UuidArray` in this case) that can hold 
custom logic.

For example, a user might want to define `UuidArray` such that `arr[i]` returns 
an instance of Python's `uuid.UUID`

>From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Options for running the integration tests

2019-08-08 Thread Antoine Pitrou
On Wed, 7 Aug 2019 20:29:13 +
paddy horan  wrote:

> Hi All,
> 
> I have been away from Arrow for a while due to relocation of family and RSI.  
> I'd like to start working toward getting Rust passing the integration tests.  
> In the last few months a lot of work has been done to "dockerize" many of the 
> build steps in the project, which I'm trying to figure out.
> 
> I started out using the 'arrow_integration_xenial_base' image and submitted a 
> PR to allow it to be built from a windows host, but I noticed that there is a 
> page in the pyarrow docs related to integration testing 
> (https://arrow.apache.org/docs/developers/integration.html) that uses 
> docker-compose from the top level of the project.

That documentation page may be confusing things.  It's entitled
"integration testing" but it doesn't seem to talk about integration
tests in the Arrow sense, rather regular unit tests.

> It seems that the 'arrow_integration_xenial_base' image is replaced
> by this solution?

I have no idea.  Perhaps Krisztian knows the answer?
Personally I run C++ / Java integration tests locally, without any
Docker image. But I wouldn't be able to run the other integration
tests...

Regards

Antoine.




[jira] [Created] (ARROW-6175) [Java] Fix MapVector#getMinorType and extend AbstractContainerVector addOrGet complex vector API

2019-08-08 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6175:
-

 Summary: [Java] Fix MapVector#getMinorType and extend 
AbstractContainerVector addOrGet complex vector API
 Key: ARROW-6175
 URL: https://issues.apache.org/jira/browse/ARROW-6175
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


i. Currently {{MapVector#getMinorType}} extends {{ListVector}} which returns 
the wrong {{MinorType}}.

ii. {{AbstractContainerVector}} now only has {{addOrGetList}}, 
{{addOrGetUnion}}, {{addOrGetStruct}} which not support all complex type like 
{{MapVector}} and {{FixedSizeListVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Ursabot configuration within Arrow

2019-08-08 Thread Antoine Pitrou


Le 08/08/2019 à 16:12, Krisztián Szűcs a écrit :
> Hi All!
> 
> Ursabot now supports debugging failed builds with attach attaching
> shells to the still builds right after a failing build step:
> 
> $ ursabot project build --attach-on-failure `AMD64 Conda C++`
> 
> And local source/git directories can also be mounted to the builder
> instead of cloning arrow, this makes the debugging a lot easier:
> 
> $ ursabot project build -s ~/Workspace/arrow:. 'AMD64 Conda C++'
> 
> Mount destination `.` is relative to the build directory on the workers.
> 
> The CI configuration for arrow is available here:
> https://github.com/ursa-labs/ursabot/tree/master/projects/arrow

As I've already said: most build configuration should *not* be in the
buildmaster configuration.  Otherwise this forces a unique build
configuration (for all branches, for all PRs) and this also forces to
restart the buildmaster when changing the build configuration (which is
not a good idea).

Compare with Travis-CI or other services:
- the CI configuration and scripts are local to the Arrow repository
- each PR or branch can change the CI configuration without impacting
other builds
- one can change the CI configuration without having to restart a global
daemon or service

Regards

Antoine.


Re: Ursabot configuration within Arrow

2019-08-08 Thread Krisztián Szűcs
Hi All!

Ursabot now supports debugging failed builds with attach attaching
shells to the still builds right after a failing build step:

$ ursabot project build --attach-on-failure `AMD64 Conda C++`

And local source/git directories can also be mounted to the builder
instead of cloning arrow, this makes the debugging a lot easier:

$ ursabot project build -s ~/Workspace/arrow:. 'AMD64 Conda C++'

Mount destination `.` is relative to the build directory on the workers.

The CI configuration for arrow is available here:
https://github.com/ursa-labs/ursabot/tree/master/projects/arrow

I'd like to proceed to the code donation, but not sure what steps are
required.
I'd also like to receive feedback from other members, because this change
would heavily affect the future of arrow's continuous integration.

Regards, Krisztian

On Wed, Jul 31, 2019 at 8:59 PM Krisztián Szűcs 
wrote:

> We can now reproduce the builds locally (without the need of
> the web UI) with a single command:
>
> To demonstrate, building the master barnch and building a pull
> request requires the following commands:
>
> $ ursabot project build 'AMD64 Ubuntu 18.04 C++'
>
> $ ursabot project build -pr  'AMD64 Ubuntu 18.04 C++'
>
> See the output here:
> https://travis-ci.org/ursa-labs/ursabot/builds/566057077#L988
>
> This effectively means, that the builders defined in ursabot
> can be directly runned on machones or CI services which have
> docker installed (with a single command).
> It also replaces the need of the docker-compose setup.
>
> I'm going to write some documentation and prepare the arrow
> builders for a donation to the arrow codebase (which of course
> requires a vote).
>
> If anyone has a question please don't hesitate to ask!
>
> Regards, Krisztian
>
>
> On Tue, Jul 30, 2019 at 4:45 PM Krisztián Szűcs 
> wrote:
>
>> Ok, but the configuration movement to arrow is orthogonal to
>> the local reproducibility feature. Could we proceed with that?
>>
>> On Tue, Jul 30, 2019 at 4:38 PM Wes McKinney  wrote:
>>
>>> I will defer to others to investigate this matter further but I would
>>> really like to see a concrete and practical path to local
>>> reproducibility before moving forward on any changes to our current
>>> CI.
>>>
>>> On Tue, Jul 30, 2019 at 7:38 AM Krisztián Szűcs
>>>  wrote:
>>> >
>>> > Fixed it and restarted a bunch of builds.
>>> >
>>> > On Tue, Jul 30, 2019 at 5:13 AM Wes McKinney 
>>> wrote:
>>> >
>>> > > By the way, can you please disable the Buildbot builders that are
>>> > > causing builds on master to fail? We haven't had a passing build in
>>> > > over a week. Until we reconcile the build configurations we shouldn't
>>> > > be failing contributors' builds
>>> > >
>>> > > On Mon, Jul 29, 2019 at 8:23 PM Wes McKinney 
>>> wrote:
>>> > > >
>>> > > > On Mon, Jul 29, 2019 at 7:58 PM Krisztián Szűcs
>>> > > >  wrote:
>>> > > > >
>>> > > > > On Tue, Jul 30, 2019 at 1:38 AM Wes McKinney <
>>> wesmck...@gmail.com>
>>> > > wrote:
>>> > > > >
>>> > > > > > hi Krisztian,
>>> > > > > >
>>> > > > > > Before talking about any code donations or where to run
>>> builds, I
>>> > > > > > think we first need to discuss the worrisome situation where
>>> we have
>>> > > > > > in some cases 3 (or more) CI configurations for different
>>> components
>>> > > > > > in the project.
>>> > > > > >
>>> > > > > > Just taking into account out C++ build, we have:
>>> > > > > >
>>> > > > > > * A config for Travis CI
>>> > > > > > * Multiple configurations in Dockerfiles under cpp/
>>> > > > > > * A brand new (?) configuration in this third party
>>> ursa-labs/ursabot
>>> > > > > > repository
>>> > > > > >
>>> > > > > > I note for example that the "AMD64 Conda C++" Buildbot build is
>>> > > > > > failing while Travis CI is succeeding
>>> > > > > >
>>> > > > > > https://ci.ursalabs.org/#builders/66/builds/3196
>>> > > > > >
>>> > > > > > Starting from first principles, at least for Linux-based
>>> builds, what
>>> > > > > > I would like to see is:
>>> > > > > >
>>> > > > > > * A single build configuration (which can be driven by
>>> yaml-based
>>> > > > > > configuration files and environment variables), rather than 3
>>> like we
>>> > > > > > have now. This build configuration should be decoupled from
>>> any CI
>>> > > > > > platform, including Travis CI and Buildbot
>>> > > > > >
>>> > > > > Yeah, this would be the ideal setup, but I'm afraid the
>>> situation is a
>>> > > bit
>>> > > > > more complicated.
>>> > > > >
>>> > > > > TravisCI
>>> > > > > 
>>> > > > >
>>> > > > > constructed from a bunch of scripts optimized for travis, this
>>> setup is
>>> > > > > slow
>>> > > > > and hardly compatible with any of the remaining setups.
>>> > > > > I think we should ditch it.
>>> > > > >
>>> > > > > The "docker-compose setup"
>>> > > > > --
>>> > > > >
>>> > > > > Most of the Dockerfiles are part of the  docker-compose setup
>>> we've
>>> > > > > developed.
>>> > > > > This might be a good candidate as the tool to 

[jira] [Created] (ARROW-6174) [C++] Parquet tests produce invalid array

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6174:
-

 Summary: [C++] Parquet tests produce invalid array
 Key: ARROW-6174
 URL: https://issues.apache.org/jira/browse/ARROW-6174
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


If I patch {{Table::Validate()}} to also validate the underlying arrays:
{code:c++}
diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index 446010f93..e617470b5 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "arrow/array.h"
@@ -184,10 +185,18 @@ Status ChunkedArray::Validate() const {
   }
 
   const auto& type = *chunks_[0]->type();
+  // Make sure chunks all have the same type, and validate them
   for (size_t i = 1; i < chunks_.size(); ++i) {
-if (!chunks_[i]->type()->Equals(type)) {
+const Array& chunk = *chunks_[i];
+if (!chunk.type()->Equals(type)) {
   return Status::Invalid("In chunk ", i, " expected type ", 
type.ToString(),
- " but saw ", chunks_[i]->type()->ToString());
+ " but saw ", chunk.type()->ToString());
+}
+Status st = ValidateArray(chunk);
+if (!st.ok()) {
+  std::stringstream ss;
+  ss << "Chunk " << i << ": " << st.message();
+  return st.WithMessage(ss.str());
 }
   }
   return Status::OK();
@@ -343,7 +352,7 @@ class SimpleTable : public Table {
   }
 }
 
-// Make sure columns are all the same length
+// Make sure columns are all the same length, and validate them
 for (int i = 0; i < num_columns(); ++i) {
   const ChunkedArray* col = columns_[i].get();
   if (col->length() != num_rows_) {
@@ -351,6 +360,12 @@ class SimpleTable : public Table {
" expected length ", num_rows_, " but got 
length ",
col->length());
   }
+  Status st = col->Validate();
+  if (!st.ok()) {
+std::stringstream ss;
+ss << "Column " << i << ": " << st.message();
+return st.WithMessage(ss.str());
+  }
 }
 return Status::OK();
   }
{code}

... then {{parquet-arrow-test}} fails and then crashes:
{code}
[...]
[ RUN  ] TestArrowReadWrite.TableWithChunkedColumns
../src/parquet/arrow/arrow-reader-writer-test.cc:347: Failure
Failed
'WriteTable(*table, ::arrow::default_memory_pool(), sink, row_group_size, 
default_writer_properties(), arrow_properties)' failed with Invalid: Column 0: 
Chunk 1: Final offset invariant not equal to values length: 210!=733
In ../src/arrow/array.cc, line 1229, code: ValidateListArray(array)
In ../src/parquet/arrow/writer.cc, line 1210, code: table.Validate()
In ../src/parquet/arrow/writer.cc, line 1252, code: writer->WriteTable(table, 
chunk_size)
../src/parquet/arrow/arrow-reader-writer-test.cc:419: Failure
Expected: WriteTableToBuffer(table, row_group_size, arrow_properties, ) 
doesn't generate new fatal failures in the current thread.
  Actual: it does.
/home/antoine/arrow/dev/cpp/build-support/run-test.sh : ligne 97 : 28927 Erreur 
de segmentation  $TEST_EXECUTABLE "$@" 2>&1
 28930 Fini| $ROOT/build-support/asan_symbolize.py
 28933 Fini| ${CXXFILT:-c++filt}
 28936 Fini| 
$ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE
 28939 Fini| $pipe_cmd 2>&1
 28941 Fini| tee $LOGFILE
~/arrow/dev/cpp/build-test/src/parquet

{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6173) [Python] error loading csv submodule

2019-08-08 Thread Igor Yastrebov (JIRA)
Igor Yastrebov created ARROW-6173:
-

 Summary: [Python] error loading csv submodule
 Key: ARROW-6173
 URL: https://issues.apache.org/jira/browse/ARROW-6173
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: Windows 7, conda 4.7.11
Reporter: Igor Yastrebov


When I create a new environment in conda:

 
{code:java}
conda create -n pyarrow-test python=3.7 pyarrow=0.14.1
{code}
 

and try to read a csv file:

 
{code:java}
import pyarrow as pa
pa.read_csv('test.csv'){code}
it fails with an error:

 

 
{code:java}
Traceback (most recent call last):
File "", line 1, in 
AttributeError: module 'pyarrow' has no attribute 'csv'
{code}
However, loading it directly works:

 

 
{code:java}
import pyarrow.csv as pc
table = pc.read_csv('test.csv')
{code}
 

and using pa.csv.read_csv() after loading it directly also works.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6172) [Java] Avoid creating value holders repeatedly when reading data from JDBC

2019-08-08 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6172:
---

 Summary: [Java] Avoid creating value holders repeatedly when 
reading data from JDBC
 Key: ARROW-6172
 URL: https://issues.apache.org/jira/browse/ARROW-6172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When converting JDBC data to Arrow data. A value holder is created for each 
single value. The following code snippet gives an example:

NullableSmallIntHolder holder = new NullableSmallIntHolder();
 holder.isSet = isNonNull ? 1 : 0;
 if (isNonNull) {
 holder.value = (short) value;
 }
 smallIntVector.setSafe(rowCount, holder);
 smallIntVector.setValueCount(rowCount + 1);

 

This is inefficient, both in terms of memory usage, and computational 
efficiency. 

For most types, we can improve the performance by directly setting the value.

For example, the benchmarks on IntVector show that a 20% performance 
improvement can be achieved by directly setting the int value:

 

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 5 15.397 ± 0.018 us/op
IntBenchmarks.setWithValueHolder avgt 5 19.198 ± 0.789 us/op

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6171) [R] "docker-compose run r" fails

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6171:
-

 Summary: [R] "docker-compose run r" fails
 Key: ARROW-6171
 URL: https://issues.apache.org/jira/browse/ARROW-6171
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools, R
Reporter: Antoine Pitrou


I get the following failure:
{code}
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = 
DLLpath, ...):
 unable to load shared object 
'/usr/local/lib/R/site-library/00LOCK-arrow/00new/arrow/libs/arrow.so':
  /opt/conda/lib/libarrow.so.100: undefined symbol: 
LZ4F_resetDecompressionContext
Error: loading failed
Execution halted
ERROR: loading failed
* removing '/usr/local/lib/R/site-library/arrow'
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6170) [R] "docker-compose build r" is slow

2019-08-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-6170:
-

 Summary: [R] "docker-compose build r" is slow
 Key: ARROW-6170
 URL: https://issues.apache.org/jira/browse/ARROW-6170
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools, R
Reporter: Antoine Pitrou


Apparently it installs and compiles all packages in single-thread mode.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6169) A bit confused by arrow::install_arrow() inR

2019-08-08 Thread Michael Chirico (JIRA)
Michael Chirico created ARROW-6169:
--

 Summary: A bit confused by arrow::install_arrow() inR
 Key: ARROW-6169
 URL: https://issues.apache.org/jira/browse/ARROW-6169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Michael Chirico


Am trying to get up and running on arrow from R for the first time (macOS 
Mojave 10.14.6)

Started with
{code:java}
install.packages('arrow') #success!
write_parquet(iris, 'tmp.parquet') #oh no!{code}
and hit the error:

> Error in Table__from_dots(dots, schema) : Cannot call Table__from_dots(). 
>Please use arrow::install_arrow() to install required runtime libraries. 

OK, easy enough:
{code:java}
arrow::install_arrow() 
{code}
With output:
{code:java}
You may be able to get a development version of the Arrow C++ library using 
Homebrew: `brew install apache-arrow --HEAD` Or, see the Arrow C++ developer 
guide  for instructions on 
building the library from source.

After you've installed the C++ library, you'll need to reinstall the R package 
from source to find it.

Refer to the R package README 
 for further details.

If you have other trouble, or if you think this message could be improved, 
please report an issue here: 

{code}
A few points of confusion for me as a first time user:

A bit surprised I'm being directed to install the development version? If the 
current CRAN version of {{arrow}} is only compatible with the dev version, I 
guess that could be made more clear in this message. But on the other hand, the 
linked GH README suggests the opposite: "On macOS and Windows, installing a 
binary package from CRAN will handle Arrow’s C++ dependencies for you." 
However, that doesn't appear to have been the case for me.

Oh well, let's just try installing the normal version & see if that works:
{code:java}
$brew install apache-arrow
>install.packages('arrow') #reinstall in fresh session
>arrow::write_parquet(iris, 'tmp.parquet') # same error{code}
Now I try the dev version:
{code:java}
brew install apache-arrow --HEAD
# Error: apache-arrow 0.14.1 is already installed
# To install HEAD, first run `brew unlink apache-arrow`.
brew unlink apache-arrow
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/autoconf not present or broken
# Please reinstall autoconf. Sorry :(
brew install autoconf
brew install apache-arrow --HEAD
# Error: An exception occurred within a child process:
# RuntimeError: /usr/local/opt/cmake not present or broken
# Please reinstall cmake. Sorry :(
brew install cmake
brew install apache-arrow --HEAD
# cmake ../cpp -DCMAKE_C_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_CXX_FLAGS_RELEASE=-DNDEBUG 
-DCMAKE_INSTALL_PREFIX=/usr/local/Cellar/apache-arrow/HEAD-908b058 
-DCMAKE_BUILD_TYPE=Release -DCMAKE_FIND_FRAMEWORK=LAST -DCMAKE_VERBOSE_MAKEFIL
# Last 15 lines from 
/Users/michael.chirico/Library/Logs/Homebrew/apache-arrow/01.cmake:
# 
dlopen(/usr/local/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-darwin.so,
# 2): Symbol not found: ___addtf3
# Referenced from: /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Expected in: /usr/lib/libSystem.B.dylib
# in /usr/local/opt/gcc/lib/gcc/9/libquadmath.0.dylib
# Call Stack (most recent call first):
# src/arrow/python/CMakeLists.txt:23 (find_package){code}
Poked around a bit about that arrow and what I see suggests re-installing 
{{scipy}} but that didn't work ({{pip install scipy}} nor {{pip3 install 
scipy}}, though the traceback does suggest it's a Python 3 thing)

So now I'm stuck & not sure how to proceed.

I'll also add that I'm not sure what to make of this:

> After you've installed the C++ library, you'll need to reinstall the R 
>package from source to find it.

What is "find it" referring to exactly? And installing from source here means 
{{R CMD build && R CMD INSTALL}} on the cloned repo?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-08 Thread Micah Kornfield
After more investigation, it looks like Float8Benchmarks at least on my
machine are within the range of noise.

For BitVectorHelper I pushed a new commit [1], seems to bring the
BitVectorHelper benchmarks back inline (and even with some improvement for
getNullCountBenchmark).

BenchmarkMode  Cnt   Score   Error
 Units
BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.821 ± 0.031
 ns/op
BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  14.884 ± 0.141
 ns/op

I applied the same pattern to other loops that I could find, and for any
"for (long" loop on the critical path, I broke it up into two loops.  the
first loop does iteration by integer, the second finishes off for any long
values.  As a side note it seems like optimization for loops using long
counters at least have a semi-recent open bug for the JVM [2]

Thanks,
Micah

[1]
https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
[2] https://bugs.openjdk.java.net/browse/JDK-8223051

On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield 
wrote:

> Indeed, the BoundChecking and CheckNullForGet variables can make a big
> difference.  I didn't initially run the benchmarks with these turned on
> (you can see the result from above with Float8Benchmarks).  Here are new
> numbers including with the flags enabled.  It looks like using longs might
> be a little bit slower, I'll see what I can do to mitigate this.
>
> Ravindra also volunteered to try to benchmark the changes with Dremio's
> code on today's sync call.
>
> New
>
> BenchmarkMode  Cnt   Score   Error
> Units
>
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   4.176 ± 1.292
> ns/op
>
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  26.102 ± 0.700
> ns/op
>
> Float8Benchmarks.copyFromBenchmark   avgt5  7.398 ± 0.084  us/op
>
> Float8Benchmarks.readWriteBenchmark  avgt5  2.711 ± 0.057  us/op
>
>
>
> old
>
> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt5   3.828 ± 0.030
> ns/op
>
> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt5  20.611 ± 0.188
> ns/op
>
> Float8Benchmarks.copyFromBenchmark   avgt5  6.597 ± 0.462  us/op
>
> Float8Benchmarks.readWriteBenchmark  avgt5  2.615 ± 0.027  us/op
>
> On Wed, Aug 7, 2019 at 7:13 PM Fan Liya  wrote:
>
>> Hi Gonzalo,
>>
>> Thanks for sharing the performance results.
>> I am wondering if you have turned off the flag
>> BoundsChecking#BOUNDS_CHECKING_ENABLED.
>> If not, the lower throughput should be expected.
>>
>> Best,
>> Liya Fan
>>
>> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield 
>> wrote:
>>
>>> Hi Gonzalo,
>>> Thank you for the feedback.  I wasn't aware of the JIT implications.   At
>>> least on the benchmark run they don't seem to have an impact.
>>>
>>> If there are other benchmarks that people have that can validate if this
>>> change will be problematic I would appreciate trying to run them with the
>>> PR.  I will try to run the ones for zeroing/popcnt tonight to see if
>>> there
>>> is a change in those.
>>>
>>> -Micah
>>>
>>>
>>>
>>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
>>> golthir...@gmail.com> wrote:
>>>
>>> > I would recommend to take care with this kind of changes.
>>> >
>>> > I didn't try Arrow in more than one year, but by then the performance
>>> was
>>> > quite bad in comparison with plain byte buffer access
>>> > (see http://git.net/apache-arrow-development/msg02353.html *) and
>>> > there are several optimizations that the JVM (specifically, C2) does
>>> not
>>> > apply when dealing with int instead of longs. One of the
>>> > most commons is the loop unrolling and vectorization.
>>> >
>>> > * It doesn't seem the best way to reference an old email on the list,
>>> but
>>> > it is the only result shown by Google
>>> >
>>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya ()
>>> > escribió:
>>> >
>>> >> Hi Micah,
>>> >>
>>> >> Thanks for your effort. The performance result looks good.
>>> >>
>>> >> As you indicated, ArrowBuf will take additional 12 bytes (4 bytes for
>>> each
>>> >> of length, write index, and read index).
>>> >> Similar overheads also exist for vectors like BaseFixedWidthVector,
>>> >> BaseVariableWidthVector, etc.
>>> >>
>>> >> IMO, such overheads are small enough to justify the change.
>>> >> Let's check if there are other overheads.
>>> >>
>>> >> Best,
>>> >> Liya Fan
>>> >>
>>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield >> >
>>> >> wrote:
>>> >>
>>> >> > Hi Liya Fan,
>>> >> > Based on the Float8Benchmark there does not seem to be any
>>> meaningful
>>> >> > performance difference on my machine.  At least for me, the
>>> benchmarks
>>> >> are
>>> >> > not stable enough to say one is faster than the other (I've pasted
>>> >> results
>>> >> > below).  That being said my machine isn't necessarily the most
>>> reliable
>>> >> for
>>> >> > benchmarking.
>>> >> >
>>> >> > On an intuitive level, this makes sense to