[jira] [Created] (ARROW-4668) [C++] Support GCP BigQuery Storage API

2019-02-22 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-4668:
--

 Summary: [C++] Support GCP BigQuery Storage API
 Key: ARROW-4668
 URL: https://issues.apache.org/jira/browse/ARROW-4668
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield
 Fix For: 0.14.0


[https://cloud.google.com/bigquery/docs/reference/storage/]

 

Need to investigate the best way to do this maybe just see if we can build our 
client on GCP (once a protobuf definition is published to 
https://github.com/googleapis/googleapis/tree/master/google)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow and R benchmark

2019-02-22 Thread Micah Kornfield
Just to follow up on this thread, a new high throughput API [1] for reading
data out of big query was released to public beta today.  The format it
streams is AVRO so it should be higher performance then parsing JSON (and
reads can be parallelized).  Implementing AVRO reading was something I was
going to start working on in the next week or so, and I'll probably
continue on to add support to arrow C++ for the new API (I will be creating
JIRAs soon).  Given my current bandwidth (I contribute to arrow on my free
time), this will take a while.  So if people are interested in
collaborating (or taking this over) please let me know.

Also, it looks like someone took my advice and filed a feature request [2]
for surfacing apache arrow natively.

Thanks,
Micah

[1] https://cloud.google.com/bigquery/docs/reference/storage/
[2] https://issuetracker.google.com/issues/124858094

On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney  wrote:

> Would someone like to make some feature requests to Google or engage
> with them in another way? I have interacted with GCP in the past; I
> think it would be helpful for them to hear from other Arrow users or
> community members since I have been quite public as a carrier of the
> Arrow banner.
>
> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield 
> wrote:
> >
> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
> > reflects my own opinions, not those of my company.
> >
> > Jonathan and Wes,
> >
> > One way of trying to get support for this is filing a feature request at
> > [1] and getting broader customer support for it.  Another possible way of
> > gaining broader exposure within Google is collaborating with other open
> > source projects that it contributes to.  For instance there was a
> > conversation recently about the potential use of Arrow on the Apache Beam
> > mailing list [2].  I will try to post a link to this thread internally,
> but
> > I can't make any promises and likely not give any updates on progress.
> >
> > This is also very much my own opinion, but I think in order to expose
> Arrow
> > in a public API it would be nice to reach a stable major release (i.e.
> > 1.0.0) and ensure Arrow properly supports big query data-types
> > appropriately [3], (I think it mostly does but date/time might be an
> issue).
> >
> > [1]
> >
> https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
> > [2]
> >
> https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
> > [3]
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> >
> >
> > On Monday, February 4, 2019, Wes McKinney  wrote:
> >
> > > Arrow support would be an obvious win for BigQuery. I've spoken with
> > > people at Google Cloud about this in several occasions.
> > >
> > > With the gRPC / Flight work coming along it might be a good
> > > opportunity to rekindle the discussion. If anyone from GCP is reading
> > > or if you know anyone at GCP who might be able to work with us I would
> > > be very interested.
> > >
> > > One hurdle for BigQuery is that my understanding is that Google has
> > > policies in place that make it more difficult to take on external
> > > library dependencies in a sensitive system like Dremel / BigQuery. So
> > > someone from Google might have to develop an in-house Arrow
> > > implementation sufficient to send Arrow datasets from BigQuery to
> > > clients. The scope of that project is small enough (requiring only
> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
> > > Google ought to be able to get it done in a month or two of focused
> > > work.
> > >
> > > - Wes
> > >
> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang 
> > > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > I am currently working a lot with Google BigQuery in R and Python.
> > > Hadley Wickham listed this as a big bottleneck for his library
> bigrquery.
> > > >
> > > > The bottleneck for loading BigQuery data is now parsing BigQuery’s
> JSON
> > > format, which is difficult to optimise further because I’m already
> using
> > > the fastest C++ JSON parser, RapidJson. If this is still too slow
> (because
> > > you download a lot of data), see ?bq_table_download for an alternative
> > > approach.
> > > >
> > > > Is there any momentum for Arrow to partner with Google here?
> > > >
> > > > Thanks,
> > > >
> > > > Jonathan
> > > >
> > > >
> > > >
> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney 
> wrote:
> > > >>
> > > >> hi Jonathan,
> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <
> chiang...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > Hi Wes and Romain,
> > > >> >
> > > >> > I wrote a preliminary benchmark for reading and writing different
> > > file types from R into arrow, borrowed some code from Hadley. I would
> like
> > > some feedback to improve it and then possible push a R/benchmarks
> folder. I
> > > am willing to dedicate 

Re: [VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-22 Thread Kouhei Sutou
+1 (binding)

I ran the followings on Debian GNU/Linux sid:

  * ARROW_HAVE_CUDA=no dev/release/verify-release-candidate.sh source 0.12.1 0
  * dev/release/verify-release-candidate.sh binaries 0.12.1 0

with:

  * gcc (Debian 8.2.0-21) 8.2.0
  * openjdk version "1.8.0_191"
  * ruby 2.7.0dev (2019-02-02 trunk 66993) [x86_64-linux]
  * Node.js v11.10.0


Thanks,
--
kou

In <690a8752-2c31-465f-9ff7-7afa96a43...@www.fastmail.com>
  "[VOTE] Release Apache Arrow 0.12.1 RC0" on Fri, 22 Feb 2019 06:22:47 -0500,
  "Uwe L. Korn"  wrote:

> Hi,
> 
> I'd like to propose the first voteable release candidate (RC0) of Apache
> Arrow version 0.12.1. This is a minor release consisting of 14 resolved JIRAs
> [1].
> 
> This release candidate is based on commit:
> ba09a9e93dc28da629f63e101e231c8b8df942d3 [2]
> 
> The source release rc4 is hosted at [3].
> The binary artifacts are hosted at [4].
> The changelog is located at [5].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [6] for how to validate a release candidate.
> Please use the verification script from the master, because it has required
> a patch to work after the recent conda-forge compiler migration [1].
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 0.12.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 0.12.0 because...
> 
> - Uwe
> 
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.1
> [2]: 
> https://github.com/apache/arrow/commit/ba09a9e93dc28da629f63e101e231c8b8df942d3
> [3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.12.1-rc0/
> [4]: https://bintray.com/apache/arrow
> 
> [5]: 
> https://github.com/apache/arrow/blob/ba09a9e93dc28da629f63e101e231c8b8df942d3/CHANGELOG.md
> [6]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [7]: https://github.com/apache/arrow/pull/3413


[jira] [Created] (ARROW-4667) [C++] Unused function warnings with MinGW

2019-02-22 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4667:
---

 Summary: [C++] Unused function warnings with MinGW
 Key: ARROW-4667
 URL: https://issues.apache.org/jira/browse/ARROW-4667
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Support for Nested Types in Arrow Java

2019-02-22 Thread Li Jin
Hi Anoop,

Arrays and Structs are supported. I am not sure if anyone is working on map
type support.


On Fri, Feb 22, 2019 at 2:37 PM Anoop Johnson 
wrote:

> Hello Everyone --
>
> New user of Arrow here. What is the current status of nested types in
> Arrow?  https://jira.apache.org/jira/browse/ARROW-1279 indicates that
> support for maps is in the Arrow spec, but the implementation is not yet
> done. Is that correct? Is someone already working on this?
>
> Other than maps, are there any limitations with nested types? For instance,
> are deep nesting of types (arrays of structs etc) supported?
>
> Thanks,
> Anoop
>


Support for Nested Types in Arrow Java

2019-02-22 Thread Anoop Johnson
Hello Everyone --

New user of Arrow here. What is the current status of nested types in
Arrow?  https://jira.apache.org/jira/browse/ARROW-1279 indicates that
support for maps is in the Arrow spec, but the implementation is not yet
done. Is that correct? Is someone already working on this?

Other than maps, are there any limitations with nested types? For instance,
are deep nesting of types (arrays of structs etc) supported?

Thanks,
Anoop


Re: [VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-22 Thread Francois Saint-Jacques
+1 (non-binding)

* Validated sources on Ubuntu 18.04 with cmake 3.10.2
* Validated binaries

On Fri, Feb 22, 2019 at 6:33 AM Uwe L. Korn  wrote:

> +1 (binding)
>
> * Checked sources on Ubuntu 16.04 with an updated CMake and Gandiva turned
> off.
> * Verified the uploaded signatures of sources and binaries.
>
> On Fri, Feb 22, 2019, at 12:22 PM, Uwe L. Korn wrote:
> > Hi,
> >
> > I'd like to propose the first voteable release candidate (RC0) of Apache
> > Arrow version 0.12.1. This is a minor release consisting of 14 resolved
> JIRAs
> > [1].
> >
> > This release candidate is based on commit:
> > ba09a9e93dc28da629f63e101e231c8b8df942d3 [2]
> >
> > The source release rc4 is hosted at [3].
> > The binary artifacts are hosted at [4].
> > The changelog is located at [5].
> >
> > Please download, verify checksums and signatures, run the unit tests,
> > and vote on the release. See [6] for how to validate a release candidate.
> > Please use the verification script from the master, because it has
> required
> > a patch to work after the recent conda-forge compiler migration [1].
> >
> > The vote will be open for at least 72 hours.
> >
> > [ ] +1 Release this as Apache Arrow 0.12.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Arrow 0.12.0 because...
> >
> > - Uwe
> >
> > [1]:
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.1
> > [2]:
> >
> https://github.com/apache/arrow/commit/ba09a9e93dc28da629f63e101e231c8b8df942d3
> > [3]:
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.12.1-rc0/
> > [4]: https://bintray.com/apache/arrow
> > 
> > [5]:
> >
> https://github.com/apache/arrow/blob/ba09a9e93dc28da629f63e101e231c8b8df942d3/CHANGELOG.md
> > [6]:
> >
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> > [7]: https://github.com/apache/arrow/pull/3413
> >
>


Question about pyarrow array representation.

2019-02-22 Thread peng yu
Hey channel,

I'm trying to fix issue 4350
, which is a result of
arrow.Table<-> pandas.df conversion is not symmetric

basically we decide to use numpy array as the basis of list when converting
from arrow table to pandas. which makes me wondering why? since pandas df
doesn't have a very good multi-dimension array schema. and numpy array is
very strict with the compact memory layout. So it is very hard to support
any more than 1-D array/list data inside of pandas df cell using numpy 1D
array of 1D array.  And it actually caused a lot of problem for us to use
the result of arrow in python land.

Would it be possible to just use pure python list if the data is more than
1D ?

I can't think of any easy fix for that issue, if any of you have any
suggestions, please let me know :)

THanks!


[jira] [Created] (ARROW-4665) [C++] With glog activated, DCHECK macros are redefined

2019-02-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4665:
--

 Summary: [C++] With glog activated, DCHECK macros are redefined
 Key: ARROW-4665
 URL: https://issues.apache.org/jira/browse/ARROW-4665
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Uwe L. Korn
 Fix For: 0.13.0


When building with {{glog}}, I get errors that the {{DCHECK_*}} macros are 
redefined:

{code}
In file included from /arrow/cpp/src/arrow/util/logging.cc:27:
glog_ep-prefix/src/glog_ep/include/glog/logging.h:996: error: "DCHECK" 
redefined [-Werror]
 #define DCHECK(condition) CHECK(condition)

In file included from /arrow/cpp/src/arrow/util/logging.cc:18:
/arrow/cpp/src/arrow/util/logging.h:112: note: this is the location of the 
previous definition
 #define DCHECK(condition) ARROW_CHECK(condition)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4664) [C++] DCHECK macro conditions are evaluated in release builds

2019-02-22 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4664:


 Summary: [C++] DCHECK macro conditions are evaluated in release 
builds
 Key: ARROW-4664
 URL: https://issues.apache.org/jira/browse/ARROW-4664
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Benjamin Kietzman


{{DCHECK(potentially_expensive())}} will evaluate the argument even in release 
mode, and is used in several places with the assumption that it will do so 
(which means removing the guarantee of evaluation causes numerous failures). By 
contrast, most debug assertion macros elide their arguments entirely 
({{.assert}}, {{}}) in release mode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4663) [Packaging] Conda-forge build misses gflags on linux

2019-02-22 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4663:
--

 Summary: [Packaging] Conda-forge build misses gflags on linux
 Key: ARROW-4663
 URL: https://issues.apache.org/jira/browse/ARROW-4663
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs


See build: https://travis-ci.org/kszucs/crossbow/builds/496958426



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4662) [Python] Add type_codes property in UnionType

2019-02-22 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4662:
---

 Summary: [Python] Add type_codes property in UnionType
 Key: ARROW-4662
 URL: https://issues.apache.org/jira/browse/ARROW-4662
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4660) [C++] gflags fails to build due to CMake error

2019-02-22 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4660:
-

 Summary: [C++] gflags fails to build due to CMake error
 Key: ARROW-4660
 URL: https://issues.apache.org/jira/browse/ARROW-4660
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Francois Saint-Jacques


gflags fails to build as a thirdparty download on linux and cmake 3.10.2. 
Removing the line `target_compile_definitions(${GFLAGS_LIBRARY} INTERFACE 
"GFLAGS_IS_A_DLL=0")` makes it build without issue.
{code}
CMake Error at cmake_modules/ThirdpartyToolchain.cmake:658 
(target_compile_definitions):
Cannot specify compile definitions for imported target "gflags_static".
Call Stack (most recent call first):
CMakeLists.txt:506 (include)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4661) [C++] Consolidate random string generators for use in benchmarks and unittests

2019-02-22 Thread Hatem Helal (JIRA)
Hatem Helal created ARROW-4661:
--

 Summary: [C++] Consolidate random string generators for use in 
benchmarks and unittests
 Key: ARROW-4661
 URL: https://issues.apache.org/jira/browse/ARROW-4661
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Hatem Helal
Assignee: Hatem Helal
 Fix For: 0.14.0


This was discussed in here:

[https://github.com/apache/arrow/pull/3721]

For testing/benchmarking dictionary encoding its useful to control the number 
of repeated values and it would also be good to optionally include null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4659) [CI] ubuntu/debian nightlies fail because of missing gandiva files

2019-02-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4659:
--

 Summary: [CI] ubuntu/debian nightlies fail because of missing 
gandiva files
 Key: ARROW-4659
 URL: https://issues.apache.org/jira/browse/ARROW-4659
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration, Packaging
Reporter: Uwe L. Korn
 Fix For: 0.13.0


The nightly jobs fail with
{code:java}
   dh_install

dh_install: libgandiva13 missing files: usr/lib/*/gandiva/

dh_install: missing files, aborting

debian/rules:14: recipe for target 'binary' failed

make: *** [binary] Error 2

dpkg-buildpackage: error: fakeroot debian/rules binary gave error exit status 
2{code}
[~kou] [~kszucs] Is this because we now ship the precompiled code inside of the 
binary library?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4658) [C++] shared gflags is also a run-time conda requirement

2019-02-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4658:
--

 Summary: [C++] shared gflags is also a run-time conda requirement
 Key: ARROW-4658
 URL: https://issues.apache.org/jira/browse/ARROW-4658
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-22 Thread Uwe L. Korn
+1 (binding)

* Checked sources on Ubuntu 16.04 with an updated CMake and Gandiva turned off.
* Verified the uploaded signatures of sources and binaries.

On Fri, Feb 22, 2019, at 12:22 PM, Uwe L. Korn wrote:
> Hi,
> 
> I'd like to propose the first voteable release candidate (RC0) of Apache
> Arrow version 0.12.1. This is a minor release consisting of 14 resolved JIRAs
> [1].
> 
> This release candidate is based on commit:
> ba09a9e93dc28da629f63e101e231c8b8df942d3 [2]
> 
> The source release rc4 is hosted at [3].
> The binary artifacts are hosted at [4].
> The changelog is located at [5].
> 
> Please download, verify checksums and signatures, run the unit tests,
> and vote on the release. See [6] for how to validate a release candidate.
> Please use the verification script from the master, because it has required
> a patch to work after the recent conda-forge compiler migration [1].
> 
> The vote will be open for at least 72 hours.
> 
> [ ] +1 Release this as Apache Arrow 0.12.0
> [ ] +0
> [ ] -1 Do not release this as Apache Arrow 0.12.0 because...
> 
> - Uwe
> 
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.1
> [2]: 
> https://github.com/apache/arrow/commit/ba09a9e93dc28da629f63e101e231c8b8df942d3
> [3]: 
> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.12.1-rc0/
> [4]: https://bintray.com/apache/arrow
> 
> [5]: 
> https://github.com/apache/arrow/blob/ba09a9e93dc28da629f63e101e231c8b8df942d3/CHANGELOG.md
> [6]:
> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
> [7]: https://github.com/apache/arrow/pull/3413
>


[VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-22 Thread Uwe L. Korn
Hi,

I'd like to propose the first voteable release candidate (RC0) of Apache
Arrow version 0.12.1. This is a minor release consisting of 14 resolved JIRAs
[1].

This release candidate is based on commit:
ba09a9e93dc28da629f63e101e231c8b8df942d3 [2]

The source release rc4 is hosted at [3].
The binary artifacts are hosted at [4].
The changelog is located at [5].

Please download, verify checksums and signatures, run the unit tests,
and vote on the release. See [6] for how to validate a release candidate.
Please use the verification script from the master, because it has required
a patch to work after the recent conda-forge compiler migration [1].

The vote will be open for at least 72 hours.

[ ] +1 Release this as Apache Arrow 0.12.0
[ ] +0
[ ] -1 Do not release this as Apache Arrow 0.12.0 because...

- Uwe

[1]:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.1
[2]: 
https://github.com/apache/arrow/commit/ba09a9e93dc28da629f63e101e231c8b8df942d3
[3]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.12.1-rc0/
[4]: https://bintray.com/apache/arrow

[5]: 
https://github.com/apache/arrow/blob/ba09a9e93dc28da629f63e101e231c8b8df942d3/CHANGELOG.md
[6]:
https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates
[7]: https://github.com/apache/arrow/pull/3413


[jira] [Created] (ARROW-4657) [Release] gbenchmark should not be needed for verification

2019-02-22 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4657:
--

 Summary: [Release] gbenchmark should not be needed for verification
 Key: ARROW-4657
 URL: https://issues.apache.org/jira/browse/ARROW-4657
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Packaging
Affects Versions: 0.12.0, 0.12.1
Reporter: Uwe L. Korn
 Fix For: 0.13.0


{{gbenchmark}} is built during verification and thus we require a minimal 
version of CMake 3.6. I would have guessed that we should not require it as we 
do not need to build the benchmarks during the verification. I guess that a 
recent fix from [~wesmckinn] may have fixed this, but we should verify this 
before doing the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)