date:20190821

[jira] [Resolved] (ARROW-6070) [Java] Avoid creating new schema before IPC sending

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6070.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4968
[https://github.com/apache/arrow/pull/4968]

> [Java] Avoid creating new schema before IPC sending
> ---
>
> Key: ARROW-6070
> URL: https://issues.apache.org/jira/browse/ARROW-6070
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> If a dictionary is attached to a schema, it may need to be converted before 
> IPC sending. When this is not the case (which is most likely in practice), 
> there is no need to do the conversion and no need to create a new schema. 
> We solve the above problem by quickly determining if conversion is required, 
> and if not, we avoid creating a new schema and return the original one 
> immediately.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6300) [C++] Add io::OutputStream::Abort()

2019-08-21 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912950#comment-16912950
 ] 

Micah Kornfield commented on ARROW-6300:


It sounds like it makes sense, I think not just hiding it in the destructor 
make sense (generally, I thought it was frowned upon to work that could fail in 
the destructor anyways).

 

>> 2. For regulard file systems, would it make sense to delete the underlying 
>> file (they both seem to be about clearing underlying resources?)

>I don't think so. The aim is to clear any temporary runtime resources 
>associated to the stream. We don't claim that Abort() will rollback any 
>persistent changes.

Are we writing temporary files and then moving them (i.e. letting the OS 
cleanup the temporary file)?

> [C++] Add io::OutputStream::Abort()
> ---
>
> Key: ARROW-6300
> URL: https://issues.apache.org/jira/browse/ARROW-6300
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> This method would abort the current output stream without trying to flush or 
> commit any pending internal data. This makes sense mostly for buffered 
> streams. For other streams it could simply synonymous to Close().



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site

2019-08-21 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei resolved ARROW-6260.
-
Resolution: Fixed

> [Website] Use deploy key on Travis to build and push to asf-site
> 
>
> Key: ARROW-6260
> URL: https://issues.apache.org/jira/browse/ARROW-6260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> ARROW-4473 added CI/CD for the website, but there was some discomfort about 
> having a committer provide a GitHub personal access token to do the pushing 
> of the built site to the asf-site branch. Investigate using GitHub Deploy 
> Keys instead, which are scoped to a single repository, not all public 
> repositories that a user has access to.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site

2019-08-21 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei updated ARROW-6260:

Fix Version/s: 0.15.0

> [Website] Use deploy key on Travis to build and push to asf-site
> 
>
> Key: ARROW-6260
> URL: https://issues.apache.org/jira/browse/ARROW-6260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> ARROW-4473 added CI/CD for the website, but there was some discomfort about 
> having a committer provide a GitHub personal access token to do the pushing 
> of the built site to the asf-site branch. Investigate using GitHub Deploy 
> Keys instead, which are scoped to a single repository, not all public 
> repositories that a user has access to.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-2255) [Developer][Integration] Serialize schema- and field-level custom metadata in integration test JSON format

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-2255:
--

Assignee: Micah Kornfield

> [Developer][Integration] Serialize schema- and field-level custom metadata in 
> integration test JSON format
> --
>
> Key: ARROW-2255
> URL: https://issues.apache.org/jira/browse/ARROW-2255
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: 1.0.0
>
>
> I don't believe we are doing this at present. We should validate that each 
> implementation properly handles the incoming metadata from other Arrow 
> emitters



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-2255) [Developer][Integration] Serialize schema- and field-level custom metadata in integration test JSON format

2019-08-21 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912936#comment-16912936
 ] 

Micah Kornfield commented on ARROW-2255:


agreed.  I'll try to make this happen for 1.0.0.

> [Developer][Integration] Serialize schema- and field-level custom metadata in 
> integration test JSON format
> --
>
> Key: ARROW-2255
> URL: https://issues.apache.org/jira/browse/ARROW-2255
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I don't believe we are doing this at present. We should validate that each 
> implementation properly handles the incoming metadata from other Arrow 
> emitters



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2019-08-21 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912934#comment-16912934
 ] 

Micah Kornfield commented on ARROW-2303:


I think there is potentially a black-list.  Are there steps to reproduce (i.e. 
is there anything other then turning on HDFS and ASAN in the build necessary to 
reproduce this?)

> [C++] Disable ASAN when building io-hdfs-test.cc
> 
>
> Key: ARROW-2303
> URL: https://issues.apache.org/jira/browse/ARROW-2303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> ASAN reports spurious memory leaks in this unit test module. I am not sure 
> the easiest way to conditionally scrub the ASAN flags from such a unit test's 
> compilation flags



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6319) [C++] Extract the core of NumericTensor::Value as Tensor::Value

2019-08-21 Thread Kenta Murata (Jira)

Kenta Murata created ARROW-6319:
---

 Summary: [C++] Extract the core of NumericTensor::Value as 
Tensor::Value
 Key: ARROW-6319
 URL: https://issues.apache.org/jira/browse/ARROW-6319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I'd like to enable element-wise access in Tensor class.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4279) [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo

2019-08-21 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912926#comment-16912926
 ] 

Micah Kornfield commented on ARROW-4279:


agreed.

> [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo
> 
>
> Key: ARROW-4279
> URL: https://issues.apache.org/jira/browse/ARROW-4279
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> The old commit needs to be change to be a PR against the arrow repo and not 
> parquet-cpp.
> Changes needed as part of this:
> 1.  Allow for running both old and new code path until performance regression 
> can be eliminated.
> 2.  Instead of passing through nthreads consider using util/task-group from 
> arrow as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4279) [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-4279.

Resolution: Won't Fix

other JIRAs track the issue and rebasing is no longer practical.

> [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo
> 
>
> Key: ARROW-4279
> URL: https://issues.apache.org/jira/browse/ARROW-4279
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> The old commit needs to be change to be a PR against the arrow repo and not 
> parquet-cpp.
> Changes needed as part of this:
> 1.  Allow for running both old and new code path until performance regression 
> can be eliminated.
> 2.  Instead of passing through nthreads consider using util/task-group from 
> arrow as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6318) [Integration] Update integration test to use generated binaries to ensure backwards compatibility

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6318:
--

 Summary: [Integration] Update integration test to use generated 
binaries to ensure backwards compatibility
 Key: ARROW-6318
 URL: https://issues.apache.org/jira/browse/ARROW-6318
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
 Fix For: 0.15.0


Generate stream/file data and check it in to the testing package.  Update the 
integration script to have additional tests that run against the pregenerated 
artificats.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6317) [JS] Implement changes to ensure flatbuffer alignment

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6317:
---
Summary: [JS] Implement changes to ensure flatbuffer alignment  (was: 
[Javascript])

> [JS] Implement changes to ensure flatbuffer alignment
> -
>
> Key: ARROW-6317
> URL: https://issues.apache.org/jira/browse/ARROW-6317
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>
> See description in parent bug on requirements.
> [~bhulette] or [~paul.e.taylor] do you think one of you would be able to pick 
> this up for 0.15.0



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6314) [C++] Implement changes to ensure flatbuffer alignment.

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6314:
---
Summary: [C++] Implement changes to ensure flatbuffer alignment.  (was: 
[C++] Implement alignment to ensure flatbuffer alignemnt.)

> [C++] Implement changes to ensure flatbuffer alignment.
> ---
>
> Key: ARROW-6314
> URL: https://issues.apache.org/jira/browse/ARROW-6314
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6317) [Javascript]

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6317:
--

 Summary: [Javascript]
 Key: ARROW-6317
 URL: https://issues.apache.org/jira/browse/ARROW-6317
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: JavaScript
Reporter: Micah Kornfield
 Fix For: 0.15.0


See description in parent bug on requirements.

[~bhulette] or [~paul.e.taylor] do you think one of you would be able to pick 
this up for 0.15.0



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6316) [Go] Make change to ensure flatbuffer reads are aligned

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6316:
--

 Summary: [Go]  Make change to ensure flatbuffer reads are aligned
 Key: ARROW-6316
 URL: https://issues.apache.org/jira/browse/ARROW-6316
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Go
Reporter: Micah Kornfield
 Fix For: 0.15.0


See parent task for requirements.  [~sbinet] do you think you will be work on 
this?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6314) [C++] Implement alignment to ensure flatbuffer alignemnt.

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6314:
---
Fix Version/s: 0.15.0

> [C++] Implement alignment to ensure flatbuffer alignemnt.
> -
>
> Key: ARROW-6314
> URL: https://issues.apache.org/jira/browse/ARROW-6314
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6315) [Java] Make change to ensure flatbuffer reads are aligned

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6315:
--

 Summary: [Java] Make change to ensure flatbuffer reads are aligned 
 Key: ARROW-6315
 URL: https://issues.apache.org/jira/browse/ARROW-6315
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Micah Kornfield
 Fix For: 0.15.0


See parent bug for details on requirements.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6314) [C++] Implement alignment to ensure flatbuffer alignemnt.

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6314:
--

 Summary: [C++] Implement alignment to ensure flatbuffer alignemnt.
 Key: ARROW-6314
 URL: https://issues.apache.org/jira/browse/ARROW-6314
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Micah Kornfield
Assignee: Wes McKinney






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6313) Tracking for ensuring flatbuffer serialized values are aligned in stream/files.

2019-08-21 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6313:
---
Summary: Tracking for ensuring flatbuffer serialized values are aligned in 
stream/files.  (was: Tracking)

> Tracking for ensuring flatbuffer serialized values are aligned in 
> stream/files.
> ---
>
> Key: ARROW-6313
> URL: https://issues.apache.org/jira/browse/ARROW-6313
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>
> Overall tracking bug for implementation for IPC/File format proposed by: 
> [https://github.com/apache/arrow/pull/4951/files]
>  
> Implementations must support backwards compatibility with the old format (and 
> ideally do memcopies when required to avoid undefined behavior).  Having a 
> backwards compatible write mode is optional



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6313) Tracking

2019-08-21 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6313:
--

 Summary: Tracking
 Key: ARROW-6313
 URL: https://issues.apache.org/jira/browse/ARROW-6313
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield
 Fix For: 0.15.0


Overall tracking bug for implementation for IPC/File format proposed by: 
[https://github.com/apache/arrow/pull/4951/files]

 

Implementations must support backwards compatibility with the old format (and 
ideally do memcopies when required to avoid undefined behavior).  Having a 
backwards compatible write mode is optional



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6302) [Python][Parquet] Reading dictionary type with serialized Arrow schema does not restore "ordered" type property

2019-08-21 Thread Galuh Sahid (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912900#comment-16912900
 ] 

Galuh Sahid commented on ARROW-6302:


I'd like to attempt this if it's OK. 

[~jorisvandenbossche] I want to make sure, is 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader_internal.cc
 the only file that needs to be changed or are other files that need to be 
changed as well? 

> [Python][Parquet] Reading dictionary type with serialized Arrow schema does 
> not restore "ordered" type property
> ---
>
> Key: ARROW-6302
> URL: https://issues.apache.org/jira/browse/ARROW-6302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
>Reporter: Galuh Sahid
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> In pandas, I tried roundtripping to parquet with {{to_parquet}} and 
> {{read_parquet}}. It preserves categorical dtypes but does not preserve their 
> order.
> {code:python}
> import pandas as pd
> from pandas.io.parquet import read_parquet, to_parquet
> df = pd.DataFrame()
> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], 
> ordered=True)
> df.to_parquet()
> actual = read_parquet()
> df["a"]
> 0NaN
> 1  b
> 2  c
> 3NaN
> Name: a, dtype: category
> Categories (3, object): [b < c < d]
> actual["a"]
> 0NaN
> 1  b
> 2  c
> 3NaN
> Name: a, dtype: category
> Categories (3, object): [b, c, d]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4848) [C++] Static libparquet not compiled with -DARROW_STATIC on Windows

2019-08-21 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912883#comment-16912883
 ] 

Neal Richardson commented on ARROW-4848:


FTR this is the cmake command we use on the Windows C++ building for R, and it 
seems to work: [https://github.com/apache/arrow/blob/master/ci/PKGBUILD#L77-L97]

> [C++] Static libparquet not compiled with -DARROW_STATIC on Windows
> ---
>
> Key: ARROW-4848
> URL: https://issues.apache.org/jira/browse/ARROW-4848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>
> When trying to link the R bindings against static libparquet.a + libarrow.a 
> we get a lot of missing arrow symbol warnings from libparquet.a. I think the 
> problem is that libparquet.a was not compiled -DARROW_STATIC, and therefore 
> cannot be linked against libarrow.a.
> When arrow cmake is configured with  -DARROW_BUILD_SHARED=OFF I think it 
> should automatically use -DARROW_STATIC when compiling libparquet on Windows?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs

2019-08-21 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-4427:
--

Assignee: Neal Richardson

> Move Confluence Wiki pages to the Sphinx docs
> -
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Assignee: Neal Richardson
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  and other developers' wiki pages in Confluence. If these were moved to 
> inside the project web page, that would make it easier.
> There are 5 steps to this:
>  # Create a new directory inside of `arrow/docs/source` to house the wiki 
> pages. (It will look like the 
> [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
> [python|https://github.com/apache/arrow/tree/master/docs/source/python] 
> directories.)
>  # Copy the wiki page contents to new `*.rst` pages inside this new directory.
>  # Add an `index.rst` that links to them all with enough description to help 
> navigation.
>  # Modify the Sphinx index page 
> [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
>  to have an entry that points to the new index page made in step 3
>  # Modify the static site page 
> [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3203) [C++] Build error on Debian Buster

2019-08-21 Thread Sutou Kouhei (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912822#comment-16912822
 ] 

Sutou Kouhei commented on ARROW-3203:
-

Yes. This is outdated.

> [C++] Build error on Debian Buster
> --
>
> Key: ARROW-3203
> URL: https://issues.apache.org/jira/browse/ARROW-3203
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.10.0
>Reporter: albertoramon
>Priority: Major
> Attachments: DockerfileRV, flatbuffers_ep-build-err.log
>
>
> There is a error with Debian Buster (In Debian Stretch works fine)
> You can test it easily change the first line from dockerfile (attached)
>  
> *To reproduce it:*
> {code:java}
> docker build -f DockerfileRV -t arrow_rw .
> docker run -it arrow_rw bash
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4967) [C++] Parquet: Object type and stats lost when using 96-bit timestamps

2019-08-21 Thread Deepak Majeti (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912811#comment-16912811
 ] 

Deepak Majeti commented on ARROW-4967:
--

The comments above are correct! INT96 type is deprecated and it statistics are 
disabled by default. The timestamp byte layout in INT96 is big endian and does 
not comply with the standard sort orders in the spec.

> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> --
>
> Key: ARROW-4967
> URL: https://issues.apache.org/jira/browse/ARROW-4967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>Reporter: Diego Argueta
>Priority: Minor
>  Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes 
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> 
> foo:  INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. 
> No object type, and no column stats.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> 
> foo:  INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look 
> differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5932) [C++] undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912808#comment-16912808
 ] 

Wes McKinney commented on ARROW-5932:
-

This usually suggests you have multiple libstdc++ versions on your system and 
there is a conflict. If you can provide more details about your build 
environment we can try to help you. We have Dockerfiles that use 18.04 to build 
the project and there seems to be no issue there, so it is likely a problem 
with your environment

> [C++] undefined reference to `__cxa_init_primary_exception@CXXABI_1.3.11'
> -
>
> Key: ARROW-5932
> URL: https://issues.apache.org/jira/browse/ARROW-5932
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
> Environment: Linux Mint 19.1 Tessa
> g++-6
>Reporter: Cong Ding
>Priority: Critical
>
> I was installing Apache Arrow in my Linux Mint 19.1 Tessa server. I followed 
> the instructions on the official arrow website (using the ubuntu 18.04 
> method). However, when I was trying to compile the examples, the g++ compiler 
> threw out some errors.
> I have updated my g++ to g++-6, update my libstdc++ library, and using flag 
> -lstdc++, but it still didn't work.
>  
> {code:java}
> //代码占位符
> g++-6 -std=c++11 -larrow -lparquet main.cpp -lstdc++ 
> {code}
> The error message:
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `__cxa_init_primary_exception@CXXABI_1.3.11'
> /usr/lib/x86_64-linux-gnu/libarrow.so: undefined reference to 
> `std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11'
> collect2: error: ld returned 1 exit status.
>  
> I do not know what to do this moment. Can anyone help me?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-2681) [C++] Use source releases when building ORC instead of using GitHub tag snapshots

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912805#comment-16912805
 ] 

Wes McKinney commented on ARROW-2681:
-

I guess it's not so weird since we remove our old releases from the mirrors, 
too. 

> [C++] Use source releases when building ORC instead of using GitHub tag 
> snapshots
> -
>
> Key: ARROW-2681
> URL: https://issues.apache.org/jira/browse/ARROW-2681
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See related discussion in ORC-374. It would be better to use the release 
> artifacts that have been voted on by the ORC PMC.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-2882) [C++][Python] Support AWS Firehose partition_scheme implementation for Parquet datasets

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2882:

Labels: dataset parquet  (was: dataset datasets parquet)

> [C++][Python] Support AWS Firehose partition_scheme implementation for 
> Parquet datasets
> ---
>
> Key: ARROW-2882
> URL: https://issues.apache.org/jira/browse/ARROW-2882
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Pablo Javier Takara
>Priority: Major
>  Labels: dataset, parquet
>
> I'd like to be able to read a ParquetDataset generated by AWS Firehose.
> The only implementation at the time of writting was the partition scheme 
> created by hive (year=2018/month=01/day=11).
> AWS Firehose partition scheme is a little bit different (2018/01/11).
>  
> Thanks



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-1089) [C++/Python] Add API to write an Arrow stream into either the stream or file formats on disk

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1089:

Labels: dataset  (was: dataset datasets)

> [C++/Python] Add API to write an Arrow stream into either the stream or file 
> formats on disk
> 
>
> Key: ARROW-1089
> URL: https://issues.apache.org/jira/browse/ARROW-1089
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> For Arrow streams with unknown size, it would be useful to be able to write 
> the data to disk either as a stream or as the file format (for random access) 
> with minimal overhead; i.e. we would avoid record batch IPC loading and write 
> the raw messages directly to disk



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-2366) [Python][C++][Parquet] Support reading Parquet files having a permutation of column order

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2366:

Labels: dataset parquet  (was: dataset datasets parquet)

> [Python][C++][Parquet] Support reading Parquet files having a permutation of 
> column order
> -
>
> Key: ARROW-2366
> URL: https://issues.apache.org/jira/browse/ARROW-2366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6244) [C++] Implement Partition DataSource

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6244:

Labels: dataset  (was: dataset datasets)

> [C++] Implement Partition DataSource
> 
>
> Key: ARROW-6244
> URL: https://issues.apache.org/jira/browse/ARROW-6244
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> This is a DataSource that also has partition metadata. The end goal is to 
> support filtering with a DataSelector/Filter expression. The initial 
> implementation should not deal with PartitionScheme yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6238) [C++] Implement SimpleDataSource/SimpleDataFragment

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6238:

Labels: dataset pull-request-available  (was: dataset datasets 
pull-request-available)

> [C++] Implement SimpleDataSource/SimpleDataFragment
> ---
>
> Key: ARROW-6238
> URL: https://issues.apache.org/jira/browse/ARROW-6238
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3379) [C++] Implement regex/multichar delimiter tokenizer

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3379:

Labels: csv dataset  (was: csv dataset datasets)

> [C++] Implement regex/multichar delimiter tokenizer
> ---
>
> Key: ARROW-3379
> URL: https://issues.apache.org/jira/browse/ARROW-3379
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, dataset
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6243) [C++] Implement basic Filter expression classes

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6243:

Labels: dataset pull-request-available  (was: dataset datasets 
pull-request-available)

> [C++] Implement basic Filter expression classes
> ---
>
> Key: ARROW-6243
> URL: https://issues.apache.org/jira/browse/ARROW-6243
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This will draft the basic classes for creating boolean expressions that are 
> passed to the DataSources/DataFragments for predicate push-down.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4076) [Python] schema validation and filters

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4076:

Labels: dataset easyfix parquet pull-request-available  (was: dataset 
datasets easyfix parquet pull-request-available)

> [Python] schema validation and filters
> --
>
> Key: ARROW-4076
> URL: https://issues.apache.org/jira/browse/ARROW-4076
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: George Sakkis
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, easyfix, parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently [schema 
> validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900]
>  of {{ParquetDataset}} takes place before filtering. This may raise a 
> {{ValueError}} if the schema is different in some dataset pieces, even if 
> these pieces would be subsequently filtered out. I think validation should 
> happen after filtering to prevent such spurious errors:
> {noformat}
> --- a/pyarrow/parquet.py  
> +++ b/pyarrow/parquet.py  
> @@ -878,13 +878,13 @@
>  if split_row_groups:
>  raise NotImplementedError("split_row_groups not yet implemented")
>  
> -if validate_schema:
> -self.validate_schemas()
> -
>  if filters is not None:
>  filters = _check_filters(filters)
>  self._filter(filters)
>  
> +if validate_schema:
> +self.validate_schemas()
> +
>  def validate_schemas(self):
>  open_file = self._get_open_file_func()
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3538:

Labels: dataset features parquet pull-request-available  (was: dataset 
datasets features parquet pull-request-available)

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Assignee: Thomas Elvey
>Priority: Major
>  Labels: dataset, features, parquet, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6242) [C++] Implements basic Dataset/Scanner/ScannerBuilder

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6242:

Labels: dataset  (was: dataset datasets)

> [C++] Implements basic Dataset/Scanner/ScannerBuilder
> -
>
> Key: ARROW-6242
> URL: https://issues.apache.org/jira/browse/ARROW-6242
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> The goal of this would be to iterate over a Dataset and generate a 
> "flattened" stream of RecordBatches from the union of data sources and data 
> fragments. This should not bother with filtering yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6161) [C++] Implements dataset::ParquetFile and associated Scan structures

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6161:

Labels: dataset pull-request-available  (was: dataset datasets 
pull-request-available)

> [C++] Implements dataset::ParquetFile and associated Scan structures
> 
>
> Key: ARROW-6161
> URL: https://issues.apache.org/jira/browse/ARROW-6161
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> This is first baby step in supporting datasets. The initial implementation 
> will be minimal and trivial, no parallel, no schema adaptation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6299) [C++] Simplify FileFormat classes to singletons

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6299:

Labels: dataset  (was: dataset datasets)

> [C++] Simplify FileFormat classes to singletons
> ---
>
> Key: ARROW-6299
> URL: https://issues.apache.org/jira/browse/ARROW-6299
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Assignee: Benjamin Kietzman
>Priority: Minor
>  Labels: dataset
>
> ParquetFileFormat has no state, so passing it around by 
> shared_ptr is not necessary; we could just keep a single static 
> instance and pass raw pointers.
> [~wesmckinn] is there a case where a FileFormat might have state?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3408:

Labels: csv dataset  (was: csv dataset datasets)

> [C++] Add option to CSV reader to dictionary encode individual columns or all 
> string / binary columns
> -
>
> Key: ARROW-3408
> URL: https://issues.apache.org/jira/browse/ARROW-3408
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, dataset
> Fix For: 1.0.0
>
>
> For many datasets, dictionary encoding everything can result in drastically 
> lower memory usage and subsequently better performance in doing analytics
> One difficulty of dictionary encoding in multithreaded conversions is that 
> ideally you end up with one dictionary at the end. So you have two options:
> * Implement a concurrent hashing scheme -- for low cardinality dictionaries, 
> the overhead associated with mutex contention will not be meaningful, for 
> high cardinality it can be more of a problem
> * Hash each chunk separately, then normalize at the end
> My guess is that a crude concurrent hash table with a mutex to protect 
> mutations and resizes is going to outperform the latter



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4967) [C++] Parquet: Object type and stats lost when using 96-bit timestamps

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4967.
---
Resolution: Won't Fix

Computation of statistics is disabled for INT96. We don't intend to do anything 
about this AFAIK cc [~mdeepak]

> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> --
>
> Key: ARROW-4967
> URL: https://issues.apache.org/jira/browse/ARROW-4967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.12.1
> Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>Reporter: Diego Argueta
>Priority: Minor
>  Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes 
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> 
> foo:  INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. 
> No object type, and no column stats.
> {code}
> file schema: schema 
> 
> foo: OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> 
> foo:  INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look 
> differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4930:

Summary: [Python] Remove LIBDIR assumptions in Python build  (was: Remove 
LIBDIR assumptions in Python build)

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4966) [C++] orc::TimezoneError Can't open /usr/share/zoneinfo/GMT-00:00

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4966:

Summary: [C++] orc::TimezoneError Can't open /usr/share/zoneinfo/GMT-00:00  
(was: orc::TimezoneError Can't open /usr/share/zoneinfo/GMT-00:00)

> [C++] orc::TimezoneError Can't open /usr/share/zoneinfo/GMT-00:00
> -
>
> Key: ARROW-4966
> URL: https://issues.apache.org/jira/browse/ARROW-4966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0
>Reporter: Peter Wicks
>Priority: Major
>
> When reading some ORC files, pyarrow orc throws the following error on 
> `read()`: 
> {code:java}
> o = pf.read(){code}
> {{terminate called after throwing an instance of 'orc::TimezoneError'}}
>  {{what(): Can't open /usr/share/zoneinfo/GMT-00:00}}
> While it's true this folder does not exist, I don't think it normally does. 
> Our server has folders for `GMT`, `GMT0`, `GMT-0`, and `GMT+0`.
> ORC file was created using HIVE, compressed with Snappy. Other files from the 
> same table/partition do not throw this error. Files can be read with Hive.
> We created a soft link from the existing `GMT` timezone to this one, and it 
> fixed the issue. Then shortly I got the same error, but for `GMT+00:00`... :D 
> Soft link fixed this one also.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4880) [Python] python/asv-build.sh is probably broken after CMake refactor

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4880:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] python/asv-build.sh is probably broken after CMake refactor
> 
>
> Key: ARROW-4880
> URL: https://issues.apache.org/jira/browse/ARROW-4880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> uses {{$ARROW_BUILD_TOOLCHAIN}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4860) [C++] Build AWS C++ SDK for Windows in conda-forge

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4860.
---
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Appears this was done in 
https://github.com/conda-forge/aws-sdk-cpp-feedstock/pull/91

> [C++] Build AWS C++ SDK for Windows in conda-forge
> --
>
> Key: ARROW-4860
> URL: https://issues.apache.org/jira/browse/ARROW-4860
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> We the aws-sdk-cpp package to be able to use the C++ SDK for S3 support. it 
> is currently available for Linux and macOS



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4848) [C++] Static libparquet not compiled with -DARROW_STATIC on Windows

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912796#comment-16912796
 ] 

Wes McKinney commented on ARROW-4848:
-

This sounds right to me. [~jeroenooms] do you have some workaround implemented 
for this?

> [C++] Static libparquet not compiled with -DARROW_STATIC on Windows
> ---
>
> Key: ARROW-4848
> URL: https://issues.apache.org/jira/browse/ARROW-4848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>
> When trying to link the R bindings against static libparquet.a + libarrow.a 
> we get a lot of missing arrow symbol warnings from libparquet.a. I think the 
> problem is that libparquet.a was not compiled -DARROW_STATIC, and therefore 
> cannot be linked against libarrow.a.
> When arrow cmake is configured with  -DARROW_BUILD_SHARED=OFF I think it 
> should automatically use -DARROW_STATIC when compiling libparquet on Windows?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4848) [C++] Static libparquet not compiled with -DARROW_STATIC on Windows

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4848:

Summary: [C++] Static libparquet not compiled with -DARROW_STATIC on 
Windows  (was: Static libparquet not compiled with -DARROW_STATIC on Windows)

> [C++] Static libparquet not compiled with -DARROW_STATIC on Windows
> ---
>
> Key: ARROW-4848
> URL: https://issues.apache.org/jira/browse/ARROW-4848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>
> When trying to link the R bindings against static libparquet.a + libarrow.a 
> we get a lot of missing arrow symbol warnings from libparquet.a. I think the 
> problem is that libparquet.a was not compiled -DARROW_STATIC, and therefore 
> cannot be linked against libarrow.a.
> When arrow cmake is configured with  -DARROW_BUILD_SHARED=OFF I think it 
> should automatically use -DARROW_STATIC when compiling libparquet on Windows?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4836) "Cannot tell() a compressed stream" when using RecordBatchStreamWriter

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912794#comment-16912794
 ] 

Wes McKinney commented on ARROW-4836:
-

Indeed it seems like technically this should work. We would have to decide what 
are the semantics for Tell on a compressed stream (probably reporting the 
number of uncompressed bytes written)

> "Cannot tell() a compressed stream" when using RecordBatchStreamWriter
> --
>
> Key: ARROW-4836
> URL: https://issues.apache.org/jira/browse/ARROW-4836
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Mike Pedersen
>Priority: Major
>
> It does not seem like RecordBatchStreamWriter works with compressed streams:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.12.1'
> >>> stream = pa.output_stream('/tmp/a.gz')
> >>> batch = pa.RecordBatch.from_arrays([pa.array([1])], ['a'])
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write(batch)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/ipc.pxi", line 181, in pyarrow.lib._RecordBatchWriter.write
>   File "pyarrow/ipc.pxi", line 196, in 
> pyarrow.lib._RecordBatchWriter.write_batch
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Cannot tell() a compressed stream
> {code}
> As I understand the documentation, this should be possible, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4848) Static libparquet not compiled with -DARROW_STATIC on Windows

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4848:

Fix Version/s: 0.15.0

> Static libparquet not compiled with -DARROW_STATIC on Windows
> -
>
> Key: ARROW-4848
> URL: https://issues.apache.org/jira/browse/ARROW-4848
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Jeroen
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.15.0
>
>
> When trying to link the R bindings against static libparquet.a + libarrow.a 
> we get a lot of missing arrow symbol warnings from libparquet.a. I think the 
> problem is that libparquet.a was not compiled -DARROW_STATIC, and therefore 
> cannot be linked against libarrow.a.
> When arrow cmake is configured with  -DARROW_BUILD_SHARED=OFF I think it 
> should automatically use -DARROW_STATIC when compiling libparquet on Windows?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.

2019-08-21 Thread Sean Owen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912795#comment-16912795
 ] 

Sean Owen commented on ARROW-3786:
--

Hm, I don't know much about the merge script, but of course the current state 
in Spark is at 
https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py . I don't 
recall hearing about this kind of issue in Spark before.

What happens if you change your language in ASF JIRA to English at 
https://issues.apache.org/jira/secure/ViewProfile.jspa ?
Not that this is a full solution, but if that works, it does suggest it means 
changing something about how to call the JIRA API -- if it's even possible.

> Enable merge_arrow_pr.py script to run in non-English JIRA accounts.
> 
>
> Key: ARROW-3786
> URL: https://issues.apache.org/jira/browse/ARROW-3786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Yosuke Shiro
>Priority: Minor
>
> I read 
> [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts]
>  
> I did the following instruction.
> {code:java}
> dev/merge_arrow_pr.py{code}
> I got the following result.
> {code:java}
> Would you like to update the associated JIRA? (y/n): y
> Enter comma-separated fix version(s) [0.12.0]:
> === JIRA ARROW-3748 ===
> summary [GLib] Add GArrowCSVReader
> assigneeKouhei Sutou
> status  オープン
> url https://issues.apache.org/jira/browse/ARROW-3748
>  
> list index out of range{code}
>  
> It looks like an error on 
> [https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] .
> My JIRA account language is Japanese.
> This script does not seem to work if it is not English.
> {code:java}
> print(self.jira_con.transitions(self.jira_id))
> [{'id': '701', 'name': '課題のクローズ', 'to': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 
> 討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 
> 'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 
> 'クローズ', 'id': '6', 'statusCategory': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 
> 'key': 'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': 
> '課題を再オープンする', 'to': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': 
> '課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 
> 'iconUrl': 
> 'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': 
> '再オープン', 'id': '4', 'statusCategory': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 
> 'key': 'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4836) "Cannot tell() a compressed stream" when using RecordBatchStreamWriter

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4836:

Fix Version/s: 0.15.0

> "Cannot tell() a compressed stream" when using RecordBatchStreamWriter
> --
>
> Key: ARROW-4836
> URL: https://issues.apache.org/jira/browse/ARROW-4836
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Mike Pedersen
>Priority: Major
> Fix For: 0.15.0
>
>
> It does not seem like RecordBatchStreamWriter works with compressed streams:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '0.12.1'
> >>> stream = pa.output_stream('/tmp/a.gz')
> >>> batch = pa.RecordBatch.from_arrays([pa.array([1])], ['a'])
> >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> >>> writer.write(batch)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/ipc.pxi", line 181, in pyarrow.lib._RecordBatchWriter.write
>   File "pyarrow/ipc.pxi", line 196, in 
> pyarrow.lib._RecordBatchWriter.write_batch
>   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Cannot tell() a compressed stream
> {code}
> As I understand the documentation, this should be possible, right?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4809) [Python] import error with undefined symbol _ZNK5arrow6Status8ToStringB5xcc11Ev

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4809.
---
Resolution: Cannot Reproduce

Please reopen if you have a reproduction for us to look at

> [Python] import error with undefined symbol 
> _ZNK5arrow6Status8ToStringB5xcc11Ev
> ---
>
> Key: ARROW-4809
> URL: https://issues.apache.org/jira/browse/ARROW-4809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
> Environment: RHELS 6.10; Python 3.7.2
>Reporter: David Schwab
>Priority: Major
>
> I installed conda 4.5.12 and created a new environment named test-env. I 
> activated this environment and installed several packages with conda, 
> including pyarrow. When I run a Python shell and import pyarrow, I get the 
> following error:
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/test-env/lib/python3.7/site-packages/pyarrow/__init__.py", line 54, 
> in 
>     from pyarrow.lib import cpu_count, set cpu_count
> Import Error: 
> /test-env/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so:
>  undefined symbol:  _ZNK5arrow6Status8ToStringB5xcc11Ev
> {code}
> From Googling, I believe this has to do with the compiler flags used to build 
> either pyarrow or one of its dependencies (libboost has been suggested); I 
> can build the package from source if I need to, but I'm not sure what flags I 
> would need to set to fix the error.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4779) [CI] AppVeyor link failure

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4779.
---
Resolution: Cannot Reproduce

> [CI] AppVeyor link failure
> --
>
> Key: ARROW-4779
> URL: https://issues.apache.org/jira/browse/ARROW-4779
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Francois Saint-Jacques
>Priority: Minor
>  Labels: ci-failure
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/22841788/job/i0bmixvlw67ty284#L671
> {code:java}
>   Version 14.00.24241.7
>   ExceptionCode= C005
>   ExceptionFlags   = 
>   ExceptionAddress = 7FF78516AE57 (7FF78513) 
> "C:\PROGRA~2\MI0E91~1.0\VC\bin\amd64\link.exe"
>   NumberParameters = 0002
>   ExceptionInformation[ 0] = 
>   ExceptionInformation[ 1] = 000201EF7BF0
> CONTEXT:
>   Rax= 0011  R8 = 
>   Rbx= 00CE87C812A0  R9 = 7FF78522EA30
>   Rcx= 7FF78522EA30  R10= 
>   Rdx= 0011  R11= 00CE8834F0C0
>   Rsp= 00CE8834DC00  R12= 
>   Rbp= 00CE8834DD00  E13= 
>   Rsi=   R14= 0100
>   Rdi= 000201EF7BF0  R15= 0001
>   Rip= 7FF78516AE57  EFlags = 00010202
>   SegCs  = 0033  SegDs  = 002B
>   SegSs  = 002B  SegEs  = 002B
>   SegFs  = 0053  SegGs  = 002B
>   Dr0=   Dr3= 
>   Dr1=   Dr6= 
>   Dr2=   Dr7= 
> LINK : fatal error LNK1000: unknown error at 7FF78516AE1A; consult 
> documentation for technical support options
> [189/282] Building CXX object 
> src\arrow\CMakeFiles\arrow-scalar-test.dir\scalar-test.cc.obj
> [190/282] Building CXX object 
> src\arrow\CMakeFiles\arrow-public-api-test.dir\public-api-test.cc.obj
> ninja: build stopped: subcommand failed.
> C:\projects\arrow\cpp\build-debug>goto scriptexit 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4770) [C++][ORC] Enable copy free conversion for primitive types

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4770:

Summary: [C++][ORC] Enable copy free conversion for primitive types  (was: 
Enable copy free conversion for primitive types)

> [C++][ORC] Enable copy free conversion for primitive types
> --
>
> Key: ARROW-4770
> URL: https://issues.apache.org/jira/browse/ARROW-4770
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4771) [C++][ORC] Enable copy free conversion for Composite type

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4771:

Summary: [C++][ORC] Enable copy free conversion for Composite type  (was: 
Enable copy free conversion for Composite type)

> [C++][ORC] Enable copy free conversion for Composite type
> -
>
> Key: ARROW-4771
> URL: https://issues.apache.org/jira/browse/ARROW-4771
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4757) [C++] Nested chunked array support

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4757.
---
Resolution: Won't Fix

Now that we have the Large* types this seems less needed. 

> [C++] Nested chunked array support
> --
>
> Key: ARROW-4757
> URL: https://issues.apache.org/jira/browse/ARROW-4757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Philipp Moritz
>Priority: Major
>
> Dear all,
> I'm currently trying to lift the 2GB limit on the python serialization. For 
> this, I implemented a chunked union builder to split the array into smaller 
> arrays.
> However, some of the children of the union array can be ListArrays, which can 
> themselves contain UnionArrays which can contain ListArrays etc. I'm at a bit 
> of a loss how to handle this. In principle I'd like to chunk the children 
> too. However, currently UnionArrays can only have children of type Array, and 
> there is no way to treat a chunked array (which is a vector of Arrays) as an 
> Array to store it as a child of a UnionArray. Any ideas how to best support 
> this use case?
> -- Philipp.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4746) [C++/Python] PyDataTime_Date wrongly casted to PyDataTime_DateTime

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4746:

Fix Version/s: 0.15.0

> [C++/Python] PyDataTime_Date wrongly casted to PyDataTime_DateTime
> --
>
> Key: ARROW-4746
> URL: https://issues.apache.org/jira/browse/ARROW-4746
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pypy
> Fix For: 0.15.0
>
>
> As mentioned in 
> https://bitbucket.org/pypy/pypy/issues/2842/running-pyarrow-on-pypy-segfaults#comment-50670536,
>  we currently access a {{PyDataTime_Date}} object with a 
> {{PyDataTime_DateTime}} cast in {{PyDateTime_DATE_GET_SECOND}} in our code in 
> two instances. While CPython is able to deal with this wrong usage, PyPy is 
> not able to do so. We should separate the path here into one that deals with 
> dates and another that deals with datetimes.
> Reproducible code:
> {code:java}
> pa.array([datetime.date(2018, 5, 10)], type=pa.date64()){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4726) [C++] IntToFloatingPoint tests disabled under 32bit builds

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4726:

Fix Version/s: 1.0.0

> [C++] IntToFloatingPoint tests disabled under 32bit builds
> --
>
> Key: ARROW-4726
> URL: https://issues.apache.org/jira/browse/ARROW-4726
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Javier Luraschi
>Priority: Major
> Fix For: 1.0.0
>
>
> Follow up to needed for 
> [arrow/pull/3693/files|https://github.com/apache/arrow/pull/3693/files].
> Under cpp/src/arrow/compute/kernels/cast-test.cc, the 
> TestCast/IntToFloatingPoint test was disabled by added:
> {code:java}
> #if ARROW_BITNESS >= 64
> #endif{code}
> This should be reverted and investigated further.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4470.
---
Resolution: Cannot Reproduce

If you can provide a reproduction of the issue we can provide more advice how 
to reduce memory use (or determine whether there are bugs). Note that pyarrow 
0.14.1 has a memory use bug ARROW-6060 that is fixed in master

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -
>
> Key: ARROW-4470
> URL: https://issues.apache.org/jira/browse/ARROW-4470
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: Ivan SPM
>Priority: Major
>  Labels: dataset, datasets, parquet
> Fix For: 1.0.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4427) Move Confluence Wiki pages to the Sphinx docs

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912779#comment-16912779
 ] 

Wes McKinney commented on ARROW-4427:
-

Seems like we've made a little progress here, but there's more to do

> Move Confluence Wiki pages to the Sphinx docs
> -
>
> Key: ARROW-4427
> URL: https://issues.apache.org/jira/browse/ARROW-4427
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> It's hard to find and modify the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  and other developers' wiki pages in Confluence. If these were moved to 
> inside the project web page, that would make it easier.
> There are 5 steps to this:
>  # Create a new directory inside of `arrow/docs/source` to house the wiki 
> pages. (It will look like the 
> [cpp|https://github.com/apache/arrow/tree/master/docs/source/cpp] or 
> [python|https://github.com/apache/arrow/tree/master/docs/source/python] 
> directories.)
>  # Copy the wiki page contents to new `*.rst` pages inside this new directory.
>  # Add an `index.rst` that links to them all with enough description to help 
> navigation.
>  # Modify the Sphinx index page 
> [`arrow/docs/source/index.rst`|https://github.com/apache/arrow/blob/master/docs/source/index.rst]
>  to have an entry that points to the new index page made in step 3
>  # Modify the static site page 
> [`arrow/site/_includes/header.html`|https://github.com/apache/arrow/blob/8e195327149b670de2cd7a8cfe75bbd6f71c6b49/site/_includes/header.html#L33]
>  to point to the newly created page instead of the wiki page.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4439) [C++] Improve FindBrotli.cmake

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4439.
---
Resolution: Invalid

Please reopen with a description of the issue or create a new PR

> [C++] Improve FindBrotli.cmake
> --
>
> Key: ARROW-4439
> URL: https://issues.apache.org/jira/browse/ARROW-4439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4359) [Python] Column metadata is not saved or loaded in parquet

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912778#comment-16912778
 ] 

Wes McKinney commented on ARROW-4359:
-

This looks kinda buggy, maybe it's fixed now. I added to 0.15.0 so we can see

> [Python] Column metadata is not saved or loaded in parquet
> --
>
> Key: ARROW-4359
> URL: https://issues.apache.org/jira/browse/ARROW-4359
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Seb Fru
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Hi all,
> a while ago I posted this issue: ARROW-3866
> While working with Pyarrow I encountered another potential bug related to 
> column metadata: If I create a table containing columns with metadata 
> everything is fine. But after I save the table to parquet and load it back as 
> a table using pq.read_table, the column metadata is gone.
>  
> As of now I can not say yet whether the metadata is not saved correctly or 
> not loaded correctly, as I have no idea how to verify it. Unfortunately I 
> also don't have the time try a lot, but I wanted to let you know anyway. 
>  
> {code}
> field0 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
> field1 = pa.field('field2', pa.int64(), nullable=False)
> columns = [
> pa.column(field0, pa.array([1, 2])),
> pa.column(field1, pa.array([3, 4]))
> ]
> table = pa.Table.from_arrays(columns)
> pq.write_table(tab, path)
> tab2 = pq.read_table(path)
> tab2.column(0).field.metadata
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4359) [Python] Column metadata is not saved or loaded in parquet

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4359:

Fix Version/s: 0.15.0

> [Python] Column metadata is not saved or loaded in parquet
> --
>
> Key: ARROW-4359
> URL: https://issues.apache.org/jira/browse/ARROW-4359
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Seb Fru
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Hi all,
> a while ago I posted this issue: ARROW-3866
> While working with Pyarrow I encountered another potential bug related to 
> column metadata: If I create a table containing columns with metadata 
> everything is fine. But after I save the table to parquet and load it back as 
> a table using pq.read_table, the column metadata is gone.
>  
> As of now I can not say yet whether the metadata is not saved correctly or 
> not loaded correctly, as I have no idea how to verify it. Unfortunately I 
> also don't have the time try a lot, but I wanted to let you know anyway. 
>  
> {code}
> field0 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
> field1 = pa.field('field2', pa.int64(), nullable=False)
> columns = [
> pa.column(field0, pa.array([1, 2])),
> pa.column(field1, pa.array([3, 4]))
> ]
> table = pa.Table.from_arrays(columns)
> pq.write_table(tab, path)
> tab2 = pq.read_table(path)
> tab2.column(0).field.metadata
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4279) [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912777#comment-16912777
 ] 

Wes McKinney commented on ARROW-4279:
-

I think this task should be abandoned

> [C++] Rebase https://github.com/apache/parquet-cpp/pull/462# onto arrow repo
> 
>
> Key: ARROW-4279
> URL: https://issues.apache.org/jira/browse/ARROW-4279
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> The old commit needs to be change to be a PR against the arrow repo and not 
> parquet-cpp.
> Changes needed as part of this:
> 1.  Allow for running both old and new code path until performance regression 
> can be eliminated.
> 2.  Instead of passing through nthreads consider using util/task-group from 
> arrow as a parameter.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4220) [Python] Add buffered input and output stream ASV benchmarks with simulated high latency IO

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4220:

Fix Version/s: 0.15.0

> [Python] Add buffered input and output stream ASV benchmarks with simulated 
> high latency IO
> ---
>
> Key: ARROW-4220
> URL: https://issues.apache.org/jira/browse/ARROW-4220
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: benchmark
> Fix For: 0.15.0
>
>
> Follow up to ARROW-3126



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-4120) [Python] Define process for testing procedures that check for no macro-level memory leaks

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912776#comment-16912776
 ] 

Wes McKinney commented on ARROW-4120:
-

In the context of ARROW-6060, we had a case with runaway peak memory use. 
Otherwise there was no memory leak nor issue with Python references etc.

I think we should implement tests that check that peak memory use (at least 
according to what a {{MemoryPool}} is able to account for) does not exceed a 
given level. Then we could write better tests

> [Python] Define process for testing procedures that check for no macro-level 
> memory leaks
> -
>
> Key: ARROW-4120
> URL: https://issues.apache.org/jira/browse/ARROW-4120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Some kinds of memory leaks may be difficult to unit test for, and they may 
> not cause valgrind errors necessarily
> I had written some ad hoc leak tests in 
> https://github.com/apache/arrow/blob/master/python/scripts/test_leak.py. We 
> have some more of this in ARROW-3324. 
> It would be useful to be able to create a sort of "test suite" of memory leak 
> checks. They are a bit too intensive to run in CI (since you may have to run 
> something many iterations to see whether it leaks), but we could run them in 
> a nightly build



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-4111) [Python] Create time types from Python sequences of integers

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4111:
---

Assignee: Wes McKinney

> [Python] Create time types from Python sequences of integers
> 
>
> Key: ARROW-4111
> URL: https://issues.apache.org/jira/browse/ARROW-4111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> This works for dates, but not times:
> {code}
> > traceback 
> > >
> def test_to_pandas_deduplicate_date_time():
> nunique = 100
> repeats = 10
> 
> unique_values = list(range(nunique))
> 
> cases = [
> # array type, to_pandas options
> ('date32', {'date_as_object': True}),
> ('date64', {'date_as_object': True}),
> ('time32[ms]', {}),
> ('time64[us]', {})
> ]
> 
> for array_type, pandas_options in cases:
> >   arr = pa.array(unique_values * repeats, type=array_type)
> pyarrow/tests/test_convert_pandas.py:2392: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> pyarrow/array.pxi:175: in pyarrow.lib.array
> return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
> pyarrow/array.pxi:36: in pyarrow.lib._sequence_to_array
> check_status(ConvertPySequence(sequence, mask, options, ))
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >   raise ArrowInvalid(message)
> E   pyarrow.lib.ArrowInvalid: ../src/arrow/python/python_to_arrow.cc:1012 : 
> ../src/arrow/python/iterators.h:70 : Could not convert 0 with type int: 
> converting to time32
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4111) [Python] Create time types from Python sequences of integers

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4111:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [Python] Create time types from Python sequences of integers
> 
>
> Key: ARROW-4111
> URL: https://issues.apache.org/jira/browse/ARROW-4111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> This works for dates, but not times:
> {code}
> > traceback 
> > >
> def test_to_pandas_deduplicate_date_time():
> nunique = 100
> repeats = 10
> 
> unique_values = list(range(nunique))
> 
> cases = [
> # array type, to_pandas options
> ('date32', {'date_as_object': True}),
> ('date64', {'date_as_object': True}),
> ('time32[ms]', {}),
> ('time64[us]', {})
> ]
> 
> for array_type, pandas_options in cases:
> >   arr = pa.array(unique_values * repeats, type=array_type)
> pyarrow/tests/test_convert_pandas.py:2392: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> pyarrow/array.pxi:175: in pyarrow.lib.array
> return _sequence_to_array(obj, mask, size, type, pool, from_pandas)
> pyarrow/array.pxi:36: in pyarrow.lib._sequence_to_array
> check_status(ConvertPySequence(sequence, mask, options, ))
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >   raise ArrowInvalid(message)
> E   pyarrow.lib.ArrowInvalid: ../src/arrow/python/python_to_arrow.cc:1012 : 
> ../src/arrow/python/iterators.h:70 : Could not convert 0 with type int: 
> converting to time32
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4095) [C++] Implement optimizations for dictionary unification where dictionaries are prefixes of the unified dictionary

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4095:

Fix Version/s: (was: 1.0.0)
   0.15.0

> [C++] Implement optimizations for dictionary unification where dictionaries 
> are prefixes of the unified dictionary
> --
>
> Key: ARROW-4095
> URL: https://issues.apache.org/jira/browse/ARROW-4095
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> In the event that the unified dictionary contains other dictionaries as 
> prefixes (e.g. as the result of delta dictionaries), we can avoid memory 
> allocation and index transposition.
> See discussion at 
> https://github.com/apache/arrow/pull/3165#discussion_r243020982



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-4083) [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense Array (of the dictionary type)

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-4083.
---
Resolution: Won't Fix

I will take care of this elsewhere when it is actually needed

> [C++] Allowing ChunkedArrays to contain a mix of DictionaryArray and dense 
> Array (of the dictionary type)
> -
>
> Key: ARROW-4083
> URL: https://issues.apache.org/jira/browse/ARROW-4083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
>
> In some applications we may receive a stream of some dictionary encoded data 
> followed by some non-dictionary encoded data. For example this happens in 
> Parquet files when the dictionary reaches a certain configurable size 
> threshold.
> We should think about how we can model this in our in-memory data structures, 
> and how it can flow through to relevant computational components (i.e. 
> certain data flow observers -- like an Aggregation -- might need to be able 
> to process either a dense or dictionary encoded version of a particular array 
> in the same stream)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912773#comment-16912773
 ] 

Wes McKinney commented on ARROW-3933:
-

Added to 0.15.0 milestone so I can take a quick look to assess whether this is 
a bug in Parquet C++

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3933) [Python] Segfault reading Parquet files from GNOMAD

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3933:

Fix Version/s: 0.15.0

> [Python] Segfault reading Parquet files from GNOMAD
> ---
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
>Reporter: David Konerding
>Priority: Minor
>  Labels: parquet
> Fix For: 0.15.0
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS). 
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp 
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>  .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-0-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x7fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**, 
> unsigned long*) () from 
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3919) [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912771#comment-16912771
 ] 

Wes McKinney commented on ARROW-3919:
-

Now that we have Large* types this can be implemented more cleanly

> [Python] Support 64 bit indices for pyarrow.serialize and pyarrow.deserialize
> -
>
> Key: ARROW-3919
> URL: https://issues.apache.org/jira/browse/ARROW-3919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> see https://github.com/modin-project/modin/issues/266



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912769#comment-16912769
 ] 

Wes McKinney commented on ARROW-3786:
-

[~srowen] do you have any advice from Apache Spark since you use a similar PR 
merge script? Is it possible to override the user's locale when making REST 
calls?

> Enable merge_arrow_pr.py script to run in non-English JIRA accounts.
> 
>
> Key: ARROW-3786
> URL: https://issues.apache.org/jira/browse/ARROW-3786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Yosuke Shiro
>Priority: Minor
>
> I read 
> [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts]
>  
> I did the following instruction.
> {code:java}
> dev/merge_arrow_pr.py{code}
> I got the following result.
> {code:java}
> Would you like to update the associated JIRA? (y/n): y
> Enter comma-separated fix version(s) [0.12.0]:
> === JIRA ARROW-3748 ===
> summary [GLib] Add GArrowCSVReader
> assigneeKouhei Sutou
> status  オープン
> url https://issues.apache.org/jira/browse/ARROW-3748
>  
> list index out of range{code}
>  
> It looks like an error on 
> [https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] .
> My JIRA account language is Japanese.
> This script does not seem to work if it is not English.
> {code:java}
> print(self.jira_con.transitions(self.jira_id))
> [{'id': '701', 'name': '課題のクローズ', 'to': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 
> 討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 
> 'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 
> 'クローズ', 'id': '6', 'statusCategory': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 
> 'key': 'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': 
> '課題を再オープンする', 'to': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': 
> '課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 
> 'iconUrl': 
> 'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': 
> '再オープン', 'id': '4', 'statusCategory': {'self': 
> 'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 
> 'key': 'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3777) [Python] Implement a mock "high latency" filesystem

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3777:

Fix Version/s: 0.15.0

> [Python] Implement a mock "high latency" filesystem
> ---
>
> Key: ARROW-3777
> URL: https://issues.apache.org/jira/browse/ARROW-3777
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Some of our tools don't perform well out of the box for filesystems with high 
> latency reads, like cloud blob stores. In such cases, it may be better to use 
> buffered reads with a larger read ahead window. Having a mock filesystem to 
> introduce latency into reads will help with testing / developing APIs for this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3764) [C++] Port Python "ParquetDataset" business logic to C++

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3764:

Labels: dataset parquet  (was: dataset datasets parquet)

> [C++] Port Python "ParquetDataset" business logic to C++
> 
>
> Key: ARROW-3764
> URL: https://issues.apache.org/jira/browse/ARROW-3764
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> Along with defining appropriate abstractions for dealing with generic 
> filesystems in C++, we should implement the machinery for reading multiple 
> Parquet files in C++ so that it can reused in GLib, R, and Ruby. Otherwise 
> these languages will have to reimplement things, and this would surely result 
> in inconsistent features, bugs in some implementations but not others



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3705) [Python] Add "nrows" argument to parquet.read_table read indicated number of rows from file instead of whole file

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3705:

Labels: dataset parquet  (was: dataset datasets parquet)

> [Python] Add "nrows" argument to parquet.read_table read indicated number of 
> rows from file instead of whole file
> -
>
> Key: ARROW-3705
> URL: https://issues.apache.org/jira/browse/ARROW-3705
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> This patterns {{nrows}} in {{pandas.read_csv}}
> inspired by 
> https://stackoverflow.com/questions/53152671/how-to-read-sample-records-parquet-file-in-s3



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3263) [R] Use R sentinel values for missingness in addition to bitmask

2019-08-21 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3263:
---
Component/s: R

> [R] Use R sentinel values for missingness in addition to bitmask
> 
>
> Key: ARROW-3263
> URL: https://issues.apache.org/jira/browse/ARROW-3263
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format, R
>Reporter: Gabriel Becker
>Priority: Major
>
> R uses sentinal values to indicate missingness within Atomic vectors (read 
> arrays in Arrow parlance, AFAIK). 
> Currently according to [~wesmckinn], the current value in the array in memory 
> is undefined if the bitmap indicating missingness is set to 1. 
> This will force R to copy and modify data whenever adopting Arrow data which 
> has missingness present as a native vector.
> If the value were written to the relevant sentinal values (INT_MIN for 32 bit 
> integers, and NaN with payload 1954 for double precision floats) _in addition 
> to_ the bit mask, then R would be able to use Arrow as intended while not 
> breaking any other systems.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site

2019-08-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6260:
--
Labels: pull-request-available  (was: )

> [Website] Use deploy key on Travis to build and push to asf-site
> 
>
> Key: ARROW-6260
> URL: https://issues.apache.org/jira/browse/ARROW-6260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>
> ARROW-4473 added CI/CD for the website, but there was some discomfort about 
> having a committer provide a GitHub personal access token to do the pushing 
> of the built site to the asf-site branch. Investigate using GitHub Deploy 
> Keys instead, which are scoped to a single repository, not all public 
> repositories that a user has access to.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6011) [Python] Data incomplete when using pyarrow in pyspark in python 3.x

2019-08-21 Thread Bryan Cutler (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6011.
-
Resolution: Cannot Reproduce

I could not reproduce. We can continue the discussion in SPARK-28482 and reopen 
if we find an issue in Arrow

> [Python] Data incomplete when using pyarrow in pyspark in python 3.x
> 
>
> Key: ARROW-6011
> URL: https://issues.apache.org/jira/browse/ARROW-6011
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Affects Versions: 0.10.0, 0.14.0
> Environment: ceonts 7.4  pyarrow 0.10.0  0.14.0   python 2.7  3.5 
> 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: image-2019-07-23-16-06-49-889.png, py3.6.png, test.csv, 
> test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method. 
> However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !image-2019-07-23-16-06-49-889.png!  
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by spark which have been added by 
> us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!   
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
>  case t: Throwable =>
>  // Purposefully not using NonFatal, because even fatal exceptions
>  // we don't want to have our finallyBlock suppress
> + logInfo(t.getLocalizedMessage)
> + t.printStackTrace()
>  originalThrowable = t
>  throw originalThrowable
>  } finally {{code}
>  
>  
>  It seems the pyspark get the data from jvm , but pyarrow get the data 
> incomplete. Pyarrow side think the data is finished, then shutdown the 
> socket. At the same time, the jvm side still writes to the same socket , but 
> get socket close exception.
>  The pyarrow part is in ipc.pxi:
>   
> {code:java}
> cdef class _RecordBatchReader:
>  cdef:
>  shared_ptr[CRecordBatchReader] reader
>  shared_ptr[InputStream] in_stream
> cdef readonly:
>  Schema schema
> def _cinit_(self):
>  pass
> def _open(self, source):
>  get_input_stream(source, _stream)
>  with nogil:
>  check_status(CRecordBatchStreamReader.Open(
>  self.in_stream.get(), ))
> self.schema = pyarrow_wrap_schema(self.reader.get().schema())
> def _iter_(self):
>  while True:
>  yield self.read_next_batch()
> def get_next_batch(self):
>  import warnings
>  warnings.warn('Please use read_next_batch instead of '
>

[jira] [Closed] (ARROW-3685) [Python] Use fixed size binary for NumPy fixed-size string dtypes

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3685.
---
Resolution: Won't Fix

Per [~pitrou]'s comments, if an explicit data type is not passed, the safest 
thing to do by default is to use {{pyarrow.binary()}} to allow for values that 
have nul terminators (and so are smaller than the dtype indicates). You can 
pass a fixed-size type to {{pyarrow.array}} to override this

> [Python] Use fixed size binary for NumPy fixed-size string dtypes
> -
>
> Key: ARROW-3685
> URL: https://issues.apache.org/jira/browse/ARROW-3685
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Maarten Breddels
>Priority: Major
>
> I'm working on getting support for arrow in vaex (out of core dataframe 
> library for Python) in this PR:
> [https://github.com/maartenbreddels/vaex/pull/116]
> And I fixed length binary arrays for numpy (say dtype='S42') will be 
> converted to a non-fixed length array. Trying to convert that back to numpy 
> will fail, since there is no such conversion.
> It makes more sense to convert dtype='S42', to an arrow array with 
> pyarrow.binary(42) type. As I do in:
> https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3604) [R] Support to collect int64 as ints

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3604:

Fix Version/s: 1.0.0

> [R] Support to collect int64 as ints
> 
>
> Key: ARROW-3604
> URL: https://issues.apache.org/jira/browse/ARROW-3604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-3599) [C++] "infer" reports errors

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3599.
---
Resolution: Won't Fix

> [C++] "infer" reports errors
> 
>
> Key: ARROW-3599
> URL: https://issues.apache.org/jira/browse/ARROW-3599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> I pasted the errors generated by the [infer|https://fbinfer.com/] tool here:
> https://gist.github.com/pitrou/8512eec5f6ee31c02a58de7bdb2c3f7f
> cc [~renesugar]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3651) [Python] Datetimes from non-DateTimeIndex cannot be deserialized

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3651:

Fix Version/s: 0.15.0

> [Python] Datetimes from non-DateTimeIndex cannot be deserialized
> 
>
> Key: ARROW-3651
> URL: https://issues.apache.org/jira/browse/ARROW-3651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Given an index which contains datetimes but is no DateTimeIndex writing the 
> file works but reading back fails.
> {code:python}
> df = pd.DataFrame(1, index=pd.MultiIndex.from_arrays([[1,2],[3,4]]), 
> columns=[pd.to_datetime("2018/01/01")])
> # columns index is no DateTimeIndex anymore
> df = df.reset_index().set_index(['level_0', 'level_1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> pq.read_pandas('test.parquet').to_pandas()
> {code}
> results in 
> {code}
> KeyError  Traceback (most recent call last)
> ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
> _pandas_type_to_numpy_type(pandas_type)
> 676 try:
> --> 677 return _pandas_logical_type_map[pandas_type]
> 678 except KeyError:
> KeyError: 'datetime'
> {code}
> The created schema:
> {code}
> 2018-01-01 00:00:00: int64
> level_0: int64
> level_1: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["level_0", "level_1"], "column_indexes": 
> [{"n'
> b'ame": null, "field_name": null, "pandas_type": "datetime", 
> "nump'
> b'y_type": "object", "metadata": null}], "columns": [{"name": 
> "201'
> b'8-01-01 00:00:00", "field_name": "2018-01-01 00:00:00", 
> "pandas_'
> b'type": "int64", "numpy_type": "int64", "metadata": null}, 
> {"name'
> b'": "level_0", "field_name": "level_0", "pandas_type": "int64", 
> "'
> b'numpy_type": "int64", "metadata": null}, {"name": "level_1", 
> "fi'
> b'eld_name": "level_1", "pandas_type": "int64", "numpy_type": 
> "int'
> b'64", "metadata": null}], "pandas_version": "0.23.4"}'}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3543) [R] Time zone adjustment issue when reading Feather file written by Python

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3543:

Priority: Major  (was: Critical)

> [R] Time zone adjustment issue when reading Feather file written by Python
> --
>
> Key: ARROW-3543
> URL: https://issues.apache.org/jira/browse/ARROW-3543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Olaf
>Priority: Major
> Fix For: 1.0.0
>
>
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>   
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3590) [Python] Expose Python API for start and end offset of row group in parquet file

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3590:

Labels: dataset parquet  (was: parquet)

> [Python] Expose Python API for start and end offset of row group in parquet 
> file
> 
>
> Key: ARROW-3590
> URL: https://issues.apache.org/jira/browse/ARROW-3590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Heejong Lee
>Priority: Minor
>  Labels: dataset, parquet
>
> Is there a way to get more detailed metadata from Parquet file in Pyarrow? 
> Specifically, I want to access the start and end offset information about 
> each row group.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3424:

Component/s: C++

> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> ---
>
> Key: ARROW-3424
> URL: https://issues.apache.org/jira/browse/ARROW-3424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3424:

Labels: dataset parquet  (was: dataset datasets parquet)

> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> ---
>
> Key: ARROW-3424
> URL: https://issues.apache.org/jira/browse/ARROW-3424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3410) [C++] Streaming CSV reader interface for memory-constrainted environments

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3410:

Labels: dataset  (was: )

> [C++] Streaming CSV reader interface for memory-constrainted environments
> -
>
> Key: ARROW-3410
> URL: https://issues.apache.org/jira/browse/ARROW-3410
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
>
> CSV reads are currently all-or-nothing. If the results of parsing a CSV file 
> do not fit into memory, this can be a problem. I propose to define a 
> streaming {{RecordBatchReader}} interface so that the record batches produced 
> by reading can be written out immediately to a stream on disk, to be memory 
> mapped later



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3410:

Summary: [C++][Dataset] Streaming CSV reader interface for 
memory-constrainted environments  (was: [C++] Streaming CSV reader interface 
for memory-constrainted environments)

> [C++][Dataset] Streaming CSV reader interface for memory-constrainted 
> environments
> --
>
> Key: ARROW-3410
> URL: https://issues.apache.org/jira/browse/ARROW-3410
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
>
> CSV reads are currently all-or-nothing. If the results of parsing a CSV file 
> do not fit into memory, this can be a problem. I propose to define a 
> streaming {{RecordBatchReader}} interface so that the record batches produced 
> by reading can be written out immediately to a stream on disk, to be memory 
> mapped later



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3263) [R] Use R sentinel values for missingness in addition to bitmask

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912749#comment-16912749
 ] 

Wes McKinney commented on ARROW-3263:
-

Circling back on this discussion from a year ago.

Now that we have {{arrow::ExtensionType}} in C++, technically we could 
introduce containers for R data that has not been serialized to one of the 
built-in Arrow types. I'm not sure what you would do with such data, but 
technically it is now possible to faithfully transport unmodified R data end to 
end through Arrow's IPC / RPC machinery

> [R] Use R sentinel values for missingness in addition to bitmask
> 
>
> Key: ARROW-3263
> URL: https://issues.apache.org/jira/browse/ARROW-3263
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Gabriel Becker
>Priority: Major
>
> R uses sentinal values to indicate missingness within Atomic vectors (read 
> arrays in Arrow parlance, AFAIK). 
> Currently according to [~wesmckinn], the current value in the array in memory 
> is undefined if the bitmap indicating missingness is set to 1. 
> This will force R to copy and modify data whenever adopting Arrow data which 
> has missingness present as a native vector.
> If the value were written to the relevant sentinal values (INT_MIN for 32 bit 
> integers, and NaN with payload 1954 for double precision floats) _in addition 
> to_ the bit mask, then R would be able to use Arrow as intended while not 
> breaking any other systems.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-3203) [C++] Build error on Debian Buster

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3203.
---
Resolution: Fixed

We're successfully building for Debian Buster now so i think this is outdated

cc [~kou]

> [C++] Build error on Debian Buster
> --
>
> Key: ARROW-3203
> URL: https://issues.apache.org/jira/browse/ARROW-3203
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.10.0
>Reporter: albertoramon
>Priority: Major
> Attachments: DockerfileRV, flatbuffers_ep-build-err.log
>
>
> There is a error with Debian Buster (In Debian Stretch works fine)
> You can test it easily change the first line from dockerfile (attached)
>  
> *To reproduce it:*
> {code:java}
> docker build -f DockerfileRV -t arrow_rw .
> docker run -it arrow_rw bash
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3221) [C++][Python] Add a virtual Slice method to buffers

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3221:

Fix Version/s: 0.15.0

> [C++][Python] Add a virtual Slice method to buffers
> ---
>
> Key: ARROW-3221
> URL: https://issues.apache.org/jira/browse/ARROW-3221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Pearu Peterson
>Priority: Major
> Fix For: 0.15.0
>
>
> See
> https://github.com/apache/arrow/pull/2536#discussion_r216383211



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-3232) [Python] Return an ndarray from Column.to_pandas

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3232.
---
Resolution: Won't Fix

Column is no more

> [Python] Return an ndarray from Column.to_pandas
> 
>
> Key: ARROW-3232
> URL: https://issues.apache.org/jira/browse/ARROW-3232
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Major
>
> See discussion: 
> https://github.com/apache/arrow/pull/2535#discussion_r216299243



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3154) [Python][C++] Document how to write _metadata, _common_metadata files with Parquet datasets

2019-08-21 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912746#comment-16912746
 ] 

Wes McKinney commented on ARROW-3154:
-

This issue may be subsumed with the broader migration to a C++-based 
implementation

cc [~bkietz] [~fsaintjacques]

> [Python][C++] Document how to write _metadata, _common_metadata files with 
> Parquet datasets
> ---
>
> Key: ARROW-3154
> URL: https://issues.apache.org/jira/browse/ARROW-3154
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> This is not mentioned in great detail in 
> http://arrow.apache.org/docs/python/parquet.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3154) [Python] Document how to write _metadata, _common_metadata files with Parquet datasets

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3154:

Labels: dataset parquet  (was: parquet)

> [Python] Document how to write _metadata, _common_metadata files with Parquet 
> datasets
> --
>
> Key: ARROW-3154
> URL: https://issues.apache.org/jira/browse/ARROW-3154
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> This is not mentioned in great detail in 
> http://arrow.apache.org/docs/python/parquet.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3154) [Python][C++] Document how to write _metadata, _common_metadata files with Parquet datasets

2019-08-21 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3154:

Summary: [Python][C++] Document how to write _metadata, _common_metadata 
files with Parquet datasets  (was: [Python] Document how to write _metadata, 
_common_metadata files with Parquet datasets)

> [Python][C++] Document how to write _metadata, _common_metadata files with 
> Parquet datasets
> ---
>
> Key: ARROW-3154
> URL: https://issues.apache.org/jira/browse/ARROW-3154
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset, parquet
>
> This is not mentioned in great detail in 
> http://arrow.apache.org/docs/python/parquet.html



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6309) [C++] Parquet tests and executables are linked statically

2019-08-21 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6309:
--
Labels: pull-request-available  (was: )

> [C++] Parquet tests and executables are linked statically
> -
>
> Key: ARROW-6309
> URL: https://issues.apache.org/jira/browse/ARROW-6309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> For some reason, on Linux Parquet tests are now statically linked with 
> {{libparquet}} and {{libarrow}} by default, even though other tests (Arrow, 
> Plasma...) are dynamically-linked.
> For example:
> {code}
> $ ldd build-test/debug/parquet-schema-test 
>   linux-vdso.so.1 (0x7ffd376ad000)
>   libgtest_main.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> (0x7f3affeaf000)
>   libgtest.so => /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so 
> (0x7f3affde5000)
>   libbz2.so.1.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbz2.so.1.0 (0x7f3affdd1000)
>   liblz4.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/liblz4.so.1 
> (0x7f3aff58d000)
>   libsnappy.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libsnappy.so.1 (0x7f3aff384000)
>   libz.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libz.so.1 
> (0x7f3affdb1000)
>   libzstd.so.1.3.7 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libzstd.so.1.3.7 
> (0x7f3affd0a000)
>   libboost_filesystem.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
> (0x7f3aff168000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f3afeff4000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f3afec56000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f3affcbb000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f3afe865000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f3afe646000)
>   libboost_system.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libboost_system.so.1.67.0 
> (0x7f3afe441000)
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f3afe239000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f3affc8f000)
> {code}
> Contrast with e.g.:
> {code}
> $ ldd build-test/debug/arrow-uri-test 
>   linux-vdso.so.1 (0x7ffe07fb6000)
>   libarrow.so.15 => 
> /home/antoine/arrow/dev/cpp/build-test/debug/libarrow.so.15 
> (0x7f774f34)
>   libboost_system.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.67.0 
> (0x7f774f13b000)
>   libgtest_main.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> (0x7f7751723000)
>   libgtest.so => /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so 
> (0x7f7751659000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f774efc7000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f7751645000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f774ebd6000)
>   libaws-cpp-sdk-s3.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
> (0x7f774e99)
>   libaws-cpp-sdk-core.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
> (0x7f774e893000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f774e68f000)
>   liburiparser.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/liburiparser.so.1 
> (0x7f77515f2000)
>   libbz2.so.1.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbz2.so.1.0 (0x7f77515de000)
>   liblz4.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/liblz4.so.1 
> (0x7f774e46b000)
>   libsnappy.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libsnappy.so.1 (0x7f774e262000)
>   libz.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libz.so.1 
> (0x7f77515bc000)
>   libzstd.so.1.3.7 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libzstd.so.1.3.7 
> (0x7f774e1bd000)
>   libboost_filesystem.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
> (0x7f774dfa1000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f774dd82000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f774d9e4000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f7751503000)
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f774d7dc000)
>   libcurl.so.4 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libcurl.so.4 (0x7f7751534000)
>

[jira] [Commented] (ARROW-6309) [C++] Parquet tests and executables are linked statically

2019-08-21 Thread Sutou Kouhei (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912724#comment-16912724
 ] 

Sutou Kouhei commented on ARROW-6309:
-

We need to use static linking only on Windows, right?
I've created a pull request for the change: 
https://github.com/apache/arrow/pull/5158

> [C++] Parquet tests and executables are linked statically
> -
>
> Key: ARROW-6309
> URL: https://issues.apache.org/jira/browse/ARROW-6309
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For some reason, on Linux Parquet tests are now statically linked with 
> {{libparquet}} and {{libarrow}} by default, even though other tests (Arrow, 
> Plasma...) are dynamically-linked.
> For example:
> {code}
> $ ldd build-test/debug/parquet-schema-test 
>   linux-vdso.so.1 (0x7ffd376ad000)
>   libgtest_main.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> (0x7f3affeaf000)
>   libgtest.so => /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so 
> (0x7f3affde5000)
>   libbz2.so.1.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbz2.so.1.0 (0x7f3affdd1000)
>   liblz4.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/liblz4.so.1 
> (0x7f3aff58d000)
>   libsnappy.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libsnappy.so.1 (0x7f3aff384000)
>   libz.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libz.so.1 
> (0x7f3affdb1000)
>   libzstd.so.1.3.7 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libzstd.so.1.3.7 
> (0x7f3affd0a000)
>   libboost_filesystem.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
> (0x7f3aff168000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f3afeff4000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f3afec56000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f3affcbb000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f3afe865000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f3afe646000)
>   libboost_system.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/./libboost_system.so.1.67.0 
> (0x7f3afe441000)
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f3afe239000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f3affc8f000)
> {code}
> Contrast with e.g.:
> {code}
> $ ldd build-test/debug/arrow-uri-test 
>   linux-vdso.so.1 (0x7ffe07fb6000)
>   libarrow.so.15 => 
> /home/antoine/arrow/dev/cpp/build-test/debug/libarrow.so.15 
> (0x7f774f34)
>   libboost_system.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.67.0 
> (0x7f774f13b000)
>   libgtest_main.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> (0x7f7751723000)
>   libgtest.so => /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so 
> (0x7f7751659000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f774efc7000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f7751645000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f774ebd6000)
>   libaws-cpp-sdk-s3.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
> (0x7f774e99)
>   libaws-cpp-sdk-core.so => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
> (0x7f774e893000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f774e68f000)
>   liburiparser.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/liburiparser.so.1 
> (0x7f77515f2000)
>   libbz2.so.1.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbz2.so.1.0 (0x7f77515de000)
>   liblz4.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/liblz4.so.1 
> (0x7f774e46b000)
>   libsnappy.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libsnappy.so.1 (0x7f774e262000)
>   libz.so.1 => /home/antoine/miniconda3/envs/pyarrow/lib/libz.so.1 
> (0x7f77515bc000)
>   libzstd.so.1.3.7 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libzstd.so.1.3.7 
> (0x7f774e1bd000)
>   libboost_filesystem.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
> (0x7f774dfa1000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f774dd82000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f774d9e4000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f7751503000)
>

[jira] [Assigned] (ARROW-6312) Declare required Libs.private in arrow.pc package config

2019-08-21 Thread Sutou Kouhei (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sutou Kouhei reassigned ARROW-6312:
---

Assignee: Michael Maguire

> Declare required Libs.private in arrow.pc package config
> 
>
> Key: ARROW-6312
> URL: https://issues.apache.org/jira/browse/ARROW-6312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.1
>Reporter: Michael Maguire
>Assignee: Michael Maguire
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current arrow.pc package config file produced is deficient and doesn't 
> properly declare static libraries pre-requisities that must be linked in in 
> order to *statically* link in libarrow.a
> Currently it just has:
> ```
>  Libs: -L${libdir} -larrow
> ```
> But in cases, e.g. where you enabled snappy, brotli or zlib support in arrow, 
> our toolchains need to see an arrow.pc file something more like:
> ```
>  Libs: -L${libdir} -larrow
>  Libs.private: -lsnappy -lboost_system -lz -llz4 -lbrotlidec -lbrotlienc 
> -lbrotlicommon -lzstd
> ```
> If not, we get linkage errors.  I'm told the convention is that if the .a has 
> an UNDEF, the Requires.private plus the Libs.private should resolve all the 
> undefs. See the Libs.private info in [https://linux.die.net/man/1/pkg-config]
>  
> Note, however, as Sutou Kouhei pointed out in 
> [https://github.com/apache/arrow/pull/5123#issuecomment-522771452,] the 
> additional Libs.private need to be dynamically generated based on whether 
> functionality like snappy, brotli or zlib is enabled..



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

1 2 3 >

1 - 100 of 258 matches

Mail list logo