[jira] [Comment Edited] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974875#comment-16974875
 ] 

Micah Kornfield edited comment on ARROW-6112 at 11/15/19 7:19 AM:
--

based on discussion on mailing list, there was a request for a rebase on the 
original PR that accommodated all vector APIs.  I haven't and probably won't 
have time to do this.  It was mentioned that just redoing ArrowBuf to use 
64-bit address space might be more palatable.  So I'm going to focus on making 
that happen.

 

This will make it possible to support LargeBinary and LargeString (with the 
limitation that string length will still probably be practically limited to 2GB 
for most APIs).

 

However, for LargeArray child arrays will still be limited to 2 billion entries 
so this would be of limited utility.


was (Author: emkornfi...@gmail.com):
based on discussion on mailing list, there was a request for a rebase on the 
original PR that accommodated all vector APIs.  I haven't and probably won't 
have time to do this.  It was mentioned that just redoing ArrowBuf to use 
64-bit address space might be more palatable.  So I'm going to focus on making 
that happen.

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-11-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6112:
---
Description: 
The arrow spec allows for 64 bit address range for buffers (and arrays) we 
should support this at the API level in Java even if the current Netty backing 
buffers don't support it.

 

See comment below.  This work item will focus on allowing 64-bit addressing in 
buffers.

  was:The arrow spec allows for 64 bit address range for buffers (and arrays) 
we should support this at the API level in Java even if the current Netty 
backing buffers don't support it.


> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.
>  
> See comment below.  This work item will focus on allowing 64-bit addressing 
> in buffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6112) [Java] Update APIs to support 64-bit address space

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974875#comment-16974875
 ] 

Micah Kornfield commented on ARROW-6112:


based on discussion on mailing list, there was a request for a rebase on the 
original PR that accommodated all vector APIs.  I haven't and probably won't 
have time to do this.  It was mentioned that just redoing ArrowBuf to use 
64-bit address space might be more palatable.  So I'm going to focus on making 
that happen.

> [Java] Update APIs to support 64-bit address space
> --
>
> Key: ARROW-6112
> URL: https://issues.apache.org/jira/browse/ARROW-6112
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The arrow spec allows for 64 bit address range for buffers (and arrays) we 
> should support this at the API level in Java even if the current Netty 
> backing buffers don't support it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6887) [Java] Create prose documentation for using ValueVectors

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974862#comment-16974862
 ] 

Micah Kornfield commented on ARROW-6887:


i believe we mostly only update documentation after  release.

> [Java] Create prose documentation for using ValueVectors
> 
>
> Key: ARROW-6887
> URL: https://issues.apache.org/jira/browse/ARROW-6887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> We should create documentation (in restructured text) for the library that 
> demonstrates:
> 1.  Basic construction of ValueVectors.  Highlighting:
>     * ValueVector lifecycle
>     * Reading by rows using Readers (mentioning that it is not as efficient 
> as direct access).
>     * Populating with Writers
> 2.  Reading and writing IPC stream format and file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1175) [Java] Implement/test dictionary-encoded subfields

2019-11-14 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu resolved ARROW-1175.
---
Resolution: Fixed

> [Java] Implement/test dictionary-encoded subfields
> --
>
> Key: ARROW-1175
> URL: https://issues.apache.org/jira/browse/ARROW-1175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Wes McKinney
>Assignee: Ji Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> We do not have any tests about types like:
> {code}
> List
> {code}
> cc [~julienledem] [~elahrvivaz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type

2019-11-14 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu closed ARROW-6600.
-
Resolution: Later

> [Java] Implement dictionary-encoded subfields for Union type
> 
>
> Key: ARROW-6600
> URL: https://issues.apache.org/jira/browse/ARROW-6600
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Implement dictionary-encoded subfields for {{Union}} type. Each child vector 
> could be encodable or not.
>  
> Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor 
> List subfield encoding to keep consistent with {{Struct/Union}} type.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory

2019-11-14 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu closed ARROW-6019.
-
Resolution: Won't Do

> [Java] Port Jdbc and Avro adapter to new directory 
> ---
>
> Key: ARROW-6019
> URL: https://issues.apache.org/jira/browse/ARROW-6019
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>
> As discussed in mail list, adapters are different from native reader.
> This issue is used to track these issues:
> i. create new “contrib” directory and move Jdbc/Avro adapter to it.
> ii. provide more description.
> iii. change orc readers structure to “converter"
> cc [~emkornfi...@gmail.com]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7175) [Website] Add a security page to track when vulnerabilities are patched

2019-11-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7175:
--

 Summary: [Website] Add a security page to track when 
vulnerabilities are patched
 Key: ARROW-7175
 URL: https://issues.apache.org/jira/browse/ARROW-7175
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Micah Kornfield


we might also want to give a brief tutorial on safely using the C++ library  
(e.g. pointers to validation methods).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6887) [Java] Create prose documentation for using ValueVectors

2019-11-14 Thread Ji Liu (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974842#comment-16974842
 ] 

Ji Liu commented on ARROW-6887:
---

Seems the web content([http://arrow.apache.org/docs/]) is not updated yet, 
should we do something to make this docs work? [~wesm] [~emkornfi...@gmail.com]

 

> [Java] Create prose documentation for using ValueVectors
> 
>
> Key: ARROW-6887
> URL: https://issues.apache.org/jira/browse/ARROW-6887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> We should create documentation (in restructured text) for the library that 
> demonstrates:
> 1.  Basic construction of ValueVectors.  Highlighting:
>     * ValueVector lifecycle
>     * Reading by rows using Readers (mentioning that it is not as efficient 
> as direct access).
>     * Populating with Writers
> 2.  Reading and writing IPC stream format and file formats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-7150.

Resolution: Not A Problem

> [Python] Explain parquet file size growth
> -
>
> Key: ARROW-7150
> URL: https://issues.apache.org/jira/browse/ARROW-7150
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Mac OS X
>Reporter: Bogdan Klichuk
>Priority: Major
> Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974827#comment-16974827
 ] 

Micah Kornfield commented on ARROW-7150:


[~klichukb] I don't think either Parquet or Avro is going to give you great 
performance if you are simply sticking JSON strings in as data.  If you want 
smaller file sizes, two columns for parquet, one with field name and one with 
field value (boolean) would do a lot better.   Unfortunately, the Arrow library 
doesn't support writing nested data yet (otherwise list) would be preferred.

 

I opened https://issues.apache.org/jira/browse/ARROW-7174 which if implemented 
and there is a lot of duplication in strings might still get useful 
compression.  For now I think this is working as intended.

> [Python] Explain parquet file size growth
> -
>
> Key: ARROW-7150
> URL: https://issues.apache.org/jira/browse/ARROW-7150
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Mac OS X
>Reporter: Bogdan Klichuk
>Priority: Major
> Attachments: 820.parquet
>
>
> Having columnar storage format in mind, with gzip compression enabled, I 
> can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > # 1. read 820 rows from a parquet file
> > df.read_parquet('820.parquet')
> > # size of 820.parquet is 528K
> > len(df)
> 820
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 33M. Why is it 60 times bigger?
>  {code}
>   
> Compression works better on bigger files. How come 10x1 increase with 
> repeated data resulted in 60x growth of file? Insane imo.
>  
> Working on a periodic job that concats smaller files into bigger ones and 
> doubting now whether I need this.
>  
> I attached 820.parquet to try out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7174) [Python] Expose dictionary size parameter in python.

2019-11-14 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7174:
--

 Summary: [Python] Expose dictionary size parameter in python.
 Key: ARROW-7174
 URL: https://issues.apache.org/jira/browse/ARROW-7174
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Micah Kornfield


In some cases it might be useful to have dictionaries larger then the current 
default 1MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7174) [Python] Expose parquet dictionary size write parameter in python.

2019-11-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-7174:
---
Summary: [Python] Expose parquet dictionary size write parameter in python. 
 (was: [Python] Expose dictionary size parameter in python.)

> [Python] Expose parquet dictionary size write parameter in python.
> --
>
> Key: ARROW-7174
> URL: https://issues.apache.org/jira/browse/ARROW-7174
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Micah Kornfield
>Priority: Major
>
> In some cases it might be useful to have dictionaries larger then the current 
> default 1MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6820.
-
Resolution: Fixed

Issue resolved by pull request 5821
[https://github.com/apache/arrow/pull/5821]

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-6820:
---

Assignee: Bryan Cutler

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7173) Add test to verify Map field names can be arbitrary

2019-11-14 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7173:
---

 Summary: Add test to verify Map field names can be arbitrary
 Key: ARROW-7173
 URL: https://issues.apache.org/jira/browse/ARROW-7173
 Project: Apache Arrow
  Issue Type: Test
  Components: Integration
Reporter: Bryan Cutler


A Map has child fields and the format spec only recommends that they be named 
"entries", "key", and "value" but could be named anything. Currently, 
integration tests for Map arrays verify the exchanged schema is equal, so the 
child fields are always named the same. There should be tests that use 
different names to verify implementations can accept this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974823#comment-16974823
 ] 

Micah Kornfield commented on ARROW-6361:


[~tampler] I think in the short term updating documentation for this would be 
helpful (would you like to open a PR?).  i'm not an expert on SBT or Maven but 
is there some way to potentially fix this to avoid the workaround?

> [Java] sbt docker publish fails due to Arrow dependecies
> 
>
> Key: ARROW-6361
> URL: https://issues.apache.org/jira/browse/ARROW-6361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
> Attachments: tree.txt
>
>
> Hello guys
> I'm using Arrow in my Scala project and included Maven deps in sbt as 
> required.
> However, when I try to publish a Docker container with sbt 'docker:publish', 
> I get the following error:
> [error] 1 error was encountered during merge
> [error] java.lang.RuntimeException: deduplicate: different file contents 
> found in the following:
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties
> My project is [here|https://github.com/Clover-Group/tsp/tree/kafka].
> You may check project dependency tree attached.
> How do I fix that?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1099) [C++] Add support for PFOR integer compression

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974818#comment-16974818
 ] 

Micah Kornfield commented on ARROW-1099:


I would rather discuss this a possible follow-up to the compression proposal I 
have out, so lets close for now.

> [C++] Add support for PFOR integer compression
> --
>
> Key: ARROW-1099
> URL: https://issues.apache.org/jira/browse/ARROW-1099
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> https://github.com/lemire/FastPFor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-1099) [C++] Add support for PFOR integer compression

2019-11-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-1099.

Resolution: Later

> [C++] Add support for PFOR integer compression
> --
>
> Key: ARROW-1099
> URL: https://issues.apache.org/jira/browse/ARROW-1099
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> https://github.com/lemire/FastPFor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6818) [Doc] Format docs confusing

2019-11-14 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-6818:
--

Assignee: Micah Kornfield

> [Doc] Format docs confusing
> ---
>
> Key: ARROW-6818
> URL: https://issues.apache.org/jira/browse/ARROW-6818
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Format
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> I find there are several issues in the format docs.
> 1) there is a claimed distinction between "logical types" and "physical 
> types", but the "physical types" actually lists logical types such as Map
> 2) the "logical types" document doesn't actually list logical types, it just 
> sends to the flatbuffers file. One shouldn't have to read a flatbuffers file 
> to understand the Arrow format.
> 3) some terminology seems unusual, such as "relative type"
> 4) why is there a link to the Apache Drill docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing

2019-11-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974817#comment-16974817
 ] 

Micah Kornfield commented on ARROW-6818:


i'll see if i can address these.

> [Doc] Format docs confusing
> ---
>
> Key: ARROW-6818
> URL: https://issues.apache.org/jira/browse/ARROW-6818
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Format
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> I find there are several issues in the format docs.
> 1) there is a claimed distinction between "logical types" and "physical 
> types", but the "physical types" actually lists logical types such as Map
> 2) the "logical types" document doesn't actually list logical types, it just 
> sends to the flatbuffers file. One shouldn't have to read a flatbuffers file 
> to understand the Arrow format.
> 3) some terminology seems unusual, such as "relative type"
> 4) why is there a link to the Apache Drill docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6930.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5693
[https://github.com/apache/arrow/pull/5693]

> [Java] Create utility class for populating vector values used for test 
> purpose only
> ---
>
> Key: ARROW-6930
> URL: https://issues.apache.org/jira/browse/ARROW-6930
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> There is a lot of verbosity in the construction of Arrays for testing 
> purposes (multiple lines of setSafe(...) or set(...).
> We should start adding a utility class to make test setup clearer and more 
> concise, note this class should be located in arrow-vector test package and 
> could be used in other module’s testing by adding dependency:
> {{}}
> {{org.apache.arrow}}
> {{arrow-vector}}
> {{${project.version}}}
> {{tests}}
> {{test-jar}}
> {{test}}
> {{}}
> Usage would be something like:
> {quote}try (IntVector vector = new IntVector(“vector”, allocator)) {
> ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5);
> output = doSomethingWith(input);
> assertThat(output).isEqualTo(expected);
> }
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only

2019-11-14 Thread Ji Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Liu updated ARROW-6930:
--
Description: 
There is a lot of verbosity in the construction of Arrays for testing purposes 
(multiple lines of setSafe(...) or set(...).
We should start adding a utility class to make test setup clearer and more 
concise, note this class should be located in arrow-vector test package and 
could be used in other module’s testing by adding dependency:

{{}}
{{org.apache.arrow}}
{{arrow-vector}}
{{${project.version}}}
{{tests}}
{{test-jar}}
{{test}}
{{}}

Usage would be something like:
{quote}try (IntVector vector = new IntVector(“vector”, allocator)) {
ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5);
output = doSomethingWith(input);
assertThat(output).isEqualTo(expected);
}
{quote}

  was:
There is a lot of verbosity in the construction of Arrays for testing purposes 
(multiple lines of setSafe(...) or set(...).  We should start adding some 
static factory methods to make test setup clearer  and more concise.  A 
strawman proposal for BigIntVector might look like:

 

static BigIntVector create(String name, BufferAllocator allocator, Long... 
values).

 

Usage would be something like:

 

try (BigIntVector input = BigIntVectorCreate("sample_data", allocator, 1235L, 
null, 456L),

      BigIntVector expected = BigIntVectorCreate("sample_data", allocator, 1L, 
null, 0L),) {

   output = doSomethingWith(input);

   assertThat(output).isEqualTo(expected);

}


> [Java] Create utility class for populating vector values used for test 
> purpose only
> ---
>
> Key: ARROW-6930
> URL: https://issues.apache.org/jira/browse/ARROW-6930
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> There is a lot of verbosity in the construction of Arrays for testing 
> purposes (multiple lines of setSafe(...) or set(...).
> We should start adding a utility class to make test setup clearer and more 
> concise, note this class should be located in arrow-vector test package and 
> could be used in other module’s testing by adding dependency:
> {{}}
> {{org.apache.arrow}}
> {{arrow-vector}}
> {{${project.version}}}
> {{tests}}
> {{test-jar}}
> {{test}}
> {{}}
> Usage would be something like:
> {quote}try (IntVector vector = new IntVector(“vector”, allocator)) {
> ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5);
> output = doSomethingWith(input);
> assertThat(output).isEqualTo(expected);
> }
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6997) [Packaging] Add support for RHEL

2019-11-14 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-6997.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5823
[https://github.com/apache/arrow/pull/5823]

> [Packaging] Add support for RHEL
> 
>
> Key: ARROW-6997
> URL: https://issues.apache.org/jira/browse/ARROW-6997
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as 
> {{7Server}} from {{7}}. (Is it available on BinTray?)
> We also need to update install information. We can't install {{epel-release}} 
> by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum 
> install 
> https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See 
> https://fedoraproject.org/wiki/EPEL for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7157) [R] Add validation, helpful error message to Object$new()

2019-11-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7157.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5839
[https://github.com/apache/arrow/pull/5839]

> [R] Add validation, helpful error message to Object$new()
> -
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7170) [C++] Bundled ORC fails linking

2019-11-14 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-7170:
---

Assignee: Antoine Pitrou

> [C++] Bundled ORC fails linking
> ---
>
> Key: ARROW-7170
> URL: https://issues.apache.org/jira/browse/ARROW-7170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This shows up when building the tests as well:
> {code}
> [1/2] Linking CXX executable debug/orc-adapter-test
> FAILED: debug/orc-adapter-test 
> : && /usr/bin/ccache /usr/bin/clang++-7  -Qunused-arguments 
> -fcolor-diagnostics -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation
>  -Wno-unused-parameter -Wno-unknown-warning-option -Werror 
> -Wno-unknown-warning-option -msse4.2 -maltivec  -D_GLIBCXX_USE_CXX11_ABI=1 
> -D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g  -rdynamic 
> src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o  -o 
> debug/orc-adapter-test  
> -Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib
>  /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl 
> debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
> orc_ep-install/lib/liborc.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm 
> -lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a 
> mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread 
> -lrt -Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && :
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284:
>  error: undefined reference to 'deflateInit2_'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232:
>  error: undefined reference to 'deflateReset'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254:
>  error: undefined reference to 'deflate'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291:
>  error: undefined reference to 'deflateEnd'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405:
>  error: undefined reference to 'inflateInit2_'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430:
>  error: undefined reference to 'inflateEnd'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471:
>  error: undefined reference to 'inflateReset'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477:
>  error: undefined reference to 'inflate'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820:
>  error: undefined reference to 'snappy::GetUncompressedLength(char const*, 
> unsigned long, unsigned long*)'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828:
>  error: undefined reference to 'snappy::RawUncompress(char const*, unsigned 
> long, char*)'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894:
>  error: undefined reference to 'LZ4_decompress_safe'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.

2019-11-14 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974624#comment-16974624
 ] 

Kouhei Sutou commented on ARROW-7158:
-

We should use {{CMAKE_CXX_COMPILER_ID}} and {{CMAKE_CXX_COMPILER_VERSION}} 
instead of parsing version string from command line.

* 
https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_ID.html#variable:CMAKE_%3CLANG%3E_COMPILER_ID
* https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_VERSION.html

> [C++][Visual Studio]Build config Error on non English Version visual studio.
> 
>
> Key: ARROW-7158
> URL: https://issues.apache.org/jira/browse/ARROW-7158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Yiun Seungryong
>Assignee: Kouhei Sutou
>Priority: Minor
>
> * Build Config Error on Non English OS
>  * always show 
> {code:java}
>  Not supported MSVC compiler  {code}
>  * 
> [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44]
> There is a bug in the code below.
> {code:java}
> if(MSVC)
>  set(COMPILER_FAMILY "msvc")
>  if("${COMPILER_VERSION_FULL}" MATCHES
>  ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code}
>  * In my compiler the version display contains Korean.
> {code:java}
> Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code}
>  * Regular expression seems to need to be changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.

2019-11-14 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-7158:
---

Assignee: Kouhei Sutou

> [C++][Visual Studio]Build config Error on non English Version visual studio.
> 
>
> Key: ARROW-7158
> URL: https://issues.apache.org/jira/browse/ARROW-7158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Yiun Seungryong
>Assignee: Kouhei Sutou
>Priority: Minor
>
> * Build Config Error on Non English OS
>  * always show 
> {code:java}
>  Not supported MSVC compiler  {code}
>  * 
> [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44]
> There is a bug in the code below.
> {code:java}
> if(MSVC)
>  set(COMPILER_FAMILY "msvc")
>  if("${COMPILER_VERSION_FULL}" MATCHES
>  ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code}
>  * In my compiler the version display contains Korean.
> {code:java}
> Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code}
>  * Regular expression seems to need to be changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7172:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] Improve format of Expression::ToString
> -
>
> Key: ARROW-7172
> URL: https://issues.apache.org/jira/browse/ARROW-7172
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read 
> {{"b"_ > int32(3)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7157) [R] Add validation, helpful error message to Object$new()

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7157:
--
Labels: pull-request-available  (was: )

> [R] Add validation, helpful error message to Object$new()
> -
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Neal Richardson
>Priority: Blocker
>  Labels: pull-request-available
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio

2019-11-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reopened ARROW-7157:

  Assignee: Neal Richardson

> [R] RecordBatchFileReader - Crashes RStudio
> ---
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Neal Richardson
>Priority: Blocker
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7157) [R] Add validation, helpful error message to Object$new()

2019-11-14 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7157:
---
Summary: [R] Add validation, helpful error message to Object$new()  (was: 
[R] RecordBatchFileReader - Crashes RStudio)

> [R] Add validation, helpful error message to Object$new()
> -
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Neal Richardson
>Priority: Blocker
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-851) C++/Python: Check Boost/Arrow C++ABI for consistency

2019-11-14 Thread Uwe Korn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Korn resolved ARROW-851.

Resolution: Fixed

conda-forge moved to the new CXX ABI and thus this is no longer relevant.

> C++/Python: Check Boost/Arrow C++ABI for consistency
> 
>
> Key: ARROW-851
> URL: https://issues.apache.org/jira/browse/ARROW-851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Uwe Korn
>Priority: Major
>
> When building with dependencies from conda-forge on a newer system with GCC, 
> the C++ ABI versions can differ. We need to ensure that the versions match 
> between Boost, arrow-cpp and pyarrow in our CMake scripts.
> Depending on this, we may need to pass {{-D_GLIBCXX_USE_CXX11_ABI=0}} to 
> {{CMAKE_CXX_FLAGS}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6965) [C++][Dataset] Optionally expose partition keys as materialized columns

2019-11-14 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-6965:
---

Assignee: Ben Kietzman

> [C++][Dataset] Optionally expose partition keys as materialized columns
> ---
>
> Key: ARROW-6965
> URL: https://issues.apache.org/jira/browse/ARROW-6965
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset
>
> This would be exposed in the DataSourceDiscovery as an option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7163) [Doc] Fix double-and typos

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-7163:
-

Assignee: Brian Wignall

> [Doc] Fix double-and typos
> --
>
> Key: ARROW-7163
> URL: https://issues.apache.org/jira/browse/ARROW-7163
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Neal Richardson
>Assignee: Brian Wignall
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7069) [C++][Dataset] Replace ConstantPartitionScheme with PrefixDictionaryPartitionScheme

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7069:
--
Labels: pull-request-available  (was: )

> [C++][Dataset] Replace ConstantPartitionScheme with 
> PrefixDictionaryPartitionScheme
> ---
>
> Key: ARROW-7069
> URL: https://issues.apache.org/jira/browse/ARROW-7069
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
>
> ConstantPartitionScheme is not very useful, it'd be better to provide a 
> dictionary of prefixes which maps to provided partition expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6636) [C++] Do not build C++ command line utilities by default

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6636.
---
Resolution: Fixed

Issue resolved by pull request 5830
[https://github.com/apache/arrow/pull/5830]

> [C++] Do not build C++ command line utilities by default
> 
>
> Key: ARROW-6636
> URL: https://issues.apache.org/jira/browse/ARROW-6636
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These 
> are mostly used for integration testing, so building unit or integration 
> tests should toggle this on automatically. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6636) [C++] Do not build C++ command line utilities by default

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6636:
-

Assignee: Antoine Pitrou

> [C++] Do not build C++ command line utilities by default
> 
>
> Key: ARROW-6636
> URL: https://issues.apache.org/jira/browse/ARROW-6636
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These 
> are mostly used for integration testing, so building unit or integration 
> tests should toggle this on automatically. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6635) [C++] Do not require glog for default build

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6635.
---
Resolution: Fixed

Issue resolved by pull request 5829
[https://github.com/apache/arrow/pull/5829]

> [C++] Do not require glog for default build
> ---
>
> Key: ARROW-6635
> URL: https://issues.apache.org/jira/browse/ARROW-6635
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We should change the default for {{ARROW_USE_GLOG}} to be off



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6635) [C++] Do not require glog for default build

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6635:
-

Assignee: Antoine Pitrou

> [C++] Do not require glog for default build
> ---
>
> Key: ARROW-6635
> URL: https://issues.apache.org/jira/browse/ARROW-6635
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We should change the default for {{ARROW_USE_GLOG}} to be off



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974511#comment-16974511
 ] 

Joris Van den Bossche commented on ARROW-7168:
--

[~buhrmann] thanks for the report. When passing a type like that, I agree it 
should be honoured.

Some other observations:

Also when it's not all-NaN, the specified type gets ignored:

{code}
In [19]: cat = pd.Categorical(['a', 'b']) 

In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), 
ordered=False)  

In [21]: pa.array(cat, type=typ) 
Out[21]: 


-- dictionary:
  [
"a",
"b"
  ]
-- indices:
  [
0,
1
  ]

In [22]: pa.array(cat, type=typ).type  
Out[22]: DictionaryType(dictionary)
{code}

So I suppose it's a more general problem, not specifically related to this 
all-NaN case (it only appears for you in this case, as otherwise the specified 
type and the type from the data will probably match).

In the example I show here above, we should probably raise an error is the 
specified type is not compatible (string vs int categories).

> [Python] pa.array() doesn't respect provided dictionary type with all NaNs
> --
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type

2019-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7168:
-
Summary: [Python] pa.array() doesn't respect specified dictionary type  
(was: [Python] pa.array() doesn't respect provided dictionary type with all 
NaNs)

> [Python] pa.array() doesn't respect specified dictionary type
> -
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-7168:
-
Summary: [Python] pa.array() doesn't respect provided dictionary type with 
all NaNs  (was: pa.array() doesn't respect provided dictionary type with all 
NaNs)

> [Python] pa.array() doesn't respect provided dictionary type with all NaNs
> --
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7171) [Ruby] Pass Array for Arrow::Table#filter

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7171:
--
Labels: pull-request-available  (was: )

> [Ruby] Pass Array for Arrow::Table#filter
> --
>
> Key: ARROW-7171
> URL: https://issues.apache.org/jira/browse/ARROW-7171
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Ruby
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString

2019-11-14 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7172:
---

 Summary: [C++][Dataset] Improve format of Expression::ToString
 Key: ARROW-7172
 URL: https://issues.apache.org/jira/browse/ARROW-7172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, C++ - Dataset
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read 
{{"b"_ > int32(3)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString

2019-11-14 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974491#comment-16974491
 ] 

Ben Kietzman commented on ARROW-7172:
-

[~npr]

> [C++][Dataset] Improve format of Expression::ToString
> -
>
> Key: ARROW-7172
> URL: https://issues.apache.org/jira/browse/ARROW-7172
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read 
> {{"b"_ > int32(3)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6967) [C++] Add filter expressions for IN, COALESCE, and DROP_NULL

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6967:
--
Labels: pull-request-available  (was: )

> [C++] Add filter expressions for IN, COALESCE, and DROP_NULL
> 
>
> Key: ARROW-6967
> URL: https://issues.apache.org/jira/browse/ARROW-6967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Implement filter expressions for {{IN, COALESCE, DROP_NULL}}
> {{IN}} should be backed in TreeEvaluator by the IsIn kernel. The others 
> should be initially implemented in terms of logical and filter kernels until 
> a specialized kernel is added for those.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7171) [Ruby] Pass Array for Arrow::Table#filter

2019-11-14 Thread Yosuke Shiro (Jira)
Yosuke Shiro created ARROW-7171:
---

 Summary: [Ruby] Pass Array for Arrow::Table#filter
 Key: ARROW-7171
 URL: https://issues.apache.org/jira/browse/ARROW-7171
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++

2019-11-14 Thread Deepak Majeti (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974463#comment-16974463
 ] 

Deepak Majeti commented on ARROW-2496:
--

Not anytime soon. We are still working on improving the robustness of Libhdfs++.

> [C++] Add support for Libhdfs++
> ---
>
> Key: ARROW-2496
> URL: https://issues.apache.org/jira/browse/ARROW-2496
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: HDFS
>
> Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
> project. Details are available here.
> https://issues.apache.org/jira/browse/HDFS-8707
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6749) [Python] Conversion of non-ns timestamp array to numpy gives wrong values

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6749.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5718
[https://github.com/apache/arrow/pull/5718]

> [Python] Conversion of non-ns timestamp array to numpy gives wrong values
> -
>
> Key: ARROW-6749
> URL: https://issues.apache.org/jira/browse/ARROW-6749
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> {code}
> In [25]: np_arr = np.arange("2012-01-01", "2012-01-06", int(1e6)*60*60*24, 
> dtype="datetime64[us]")   
>   
> In [26]: np_arr   
>   
>
> Out[26]: 
> array(['2012-01-01T00:00:00.00', '2012-01-02T00:00:00.00',
>'2012-01-03T00:00:00.00', '2012-01-04T00:00:00.00',
>'2012-01-05T00:00:00.00'], dtype='datetime64[us]')
> In [27]: arr = pa.array(np_arr)   
>   
>
> In [28]: arr  
>   
>
> Out[28]: 
> 
> [
>   2012-01-01 00:00:00.00,
>   2012-01-02 00:00:00.00,
>   2012-01-03 00:00:00.00,
>   2012-01-04 00:00:00.00,
>   2012-01-05 00:00:00.00
> ]
> In [29]: arr.type 
>   
>
> Out[29]: TimestampType(timestamp[us])
> In [30]: arr.to_numpy()   
>   
>
> Out[30]: 
> array(['1970-01-16T08:09:36.0', '1970-01-16T08:11:02.4',
>'1970-01-16T08:12:28.8', '1970-01-16T08:13:55.2',
>'1970-01-16T08:15:21.6'], dtype='datetime64[ns]')
> {code}
> So it seems to simply interpret the integer microsecond values as nanoseconds 
> when converting to numpy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6176:
--
Labels: pull-request-available  (was: )

> [Python] Allow to subclass ExtensionArray to attach to custom extension type
> 
>
> Key: ARROW-6176
> URL: https://issues.apache.org/jira/browse/ARROW-6176
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Currently, you can define a custom extension type in Python with 
> {code}
> class UuidType(pa.ExtensionType):
> def __init__(self):
> pa.ExtensionType.__init__(self, pa.binary(16))
> def __reduce__(self):
> return UuidType, ()
> {code}
> but the array you can create with this is always ExtensionArray. We should 
> provide a way to define a subclass (eg `UuidArray` in this case) that can 
> hold custom logic.
> For example, a user might want to define `UuidArray` such that `arr[i]` 
> returns an instance of Python's `uuid.UUID`
> From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Thomas Buhrmann (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974427#comment-16974427
 ] 

Thomas Buhrmann commented on ARROW-7168:


Since I'm already at it, and in case somebody faces the same problem... To 
safely convert pandas categoricals to arrow, ensuring a constant type across 
batches, something like the following would work:
{code:python}
def categorical_to_arrow(ser, known_categories=None, ordered=None):
"""Safely create a pa.array from a categorial pd.Series.

Args:
ser (pd.Series): should be of CategorialDtype
known_categories (np.array): force known categories. If None, and 
the Series doesn't have any values to infer it from, will use 
an empty array of the same dtype as the categories attribute
of the Series
ordered (bool): whether categories should be ordered 
"""
n = len(ser)
all_nan = ser.isna().sum() == n
   
# Enforce provided categories, use the original ones, or enforce
# the correct value_type if Arrow would otherwise change it to 'null'
if isinstance(known_categories, np.ndarray):
dictionary = known_categories
elif all_nan:
# value_type may be known, but Arrow doesn't understand 'object' dtype
value_type = ser.cat.categories.dtype
if value_type == 'object':
value_type = 'str'
dictionary = np.array([], dtype=value_type)
else:
dictionary = ser.cat.categories

# Allow overwriting of ordered attribute
if ordered is None:
ordered = ser.cat.ordered

if all_nan:
return pa.DictionaryArray.from_arrays(
indices=-np.ones(n, dtype=ser.cat.codes.dtype),
dictionary=dictionary,
mask=np.ones(n, dtype='bool'),
ordered=ordered)
else:
return pa.DictionaryArray.from_arrays(
indices=ser.cat.codes,
dictionary=dictionary,
ordered=ordered,
from_pandas=True
)
{code}
This seems to be the only ( ?) way to have control over the resulting 
dictionary type. E.g.:
{code:python}
# String categories with and without non-NaN values
sers = [
pd.Series([None, None]).astype('object').astype('category'),
pd.Series(['a', None, None]).astype('category')
]

# The categorical types we may want
known_categories = [
None,
np.array(['a', 'b', 'c'], dtype='str'),
np.array([1, 2, 3], dtype='int8')
]

# Convert each series with each of the desired category types
for ser in sers:
for cats in categories:
arr = categorical_to_arrow(ser, known_categories=cats)
ser2 = pd.Series(arr.to_pandas())
print(f"Series: {list(ser)} | Known categories: {cats}")
print(f"Dictionary type: {arr.type}")
print(f"Roundtripped Series: \n{ser2}", "\n")
{code}
which produces:
{noformat}
Series: [nan, nan] | Known categories: None
Dictionary type: dictionary
Roundtripped Series: 
0NaN
1NaN
dtype: category
Categories (0, object): [] 

Series: [nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary
Roundtripped Series: 
0NaN
1NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: [nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary
Roundtripped Series: 
0NaN
1NaN
dtype: category
Categories (3, int64): [1, 2, 3] 

Series: ['a', nan, nan] | Known categories: None
Dictionary type: dictionary
Roundtripped Series: 
0  a
1NaN
2NaN
dtype: category
Categories (1, object): [a] 

Series: ['a', nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary
Roundtripped Series: 
0  a
1NaN
2NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: ['a', nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary
Roundtripped Series: 
0  1
1NaN
2NaN
dtype: category
Categories (3, int64): [1, 2, 3] 
{noformat}
(the last example would correspond to a recoding of the categories, but that'd 
be a usage problem...)

> pa.array() doesn't respect provided dictionary type with all NaNs
> -
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.15.1
>Reporter: Thomas Buhrmann
>Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, 

[jira] [Updated] (ARROW-7170) [C++] Bundled ORC fails linking

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7170:
--
Labels: pull-request-available  (was: )

> [C++] Bundled ORC fails linking
> ---
>
> Key: ARROW-7170
> URL: https://issues.apache.org/jira/browse/ARROW-7170
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> This shows up when building the tests as well:
> {code}
> [1/2] Linking CXX executable debug/orc-adapter-test
> FAILED: debug/orc-adapter-test 
> : && /usr/bin/ccache /usr/bin/clang++-7  -Qunused-arguments 
> -fcolor-diagnostics -fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation
>  -Wno-unused-parameter -Wno-unknown-warning-option -Werror 
> -Wno-unknown-warning-option -msse4.2 -maltivec  -D_GLIBCXX_USE_CXX11_ABI=1 
> -D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g  -rdynamic 
> src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o  -o 
> debug/orc-adapter-test  
> -Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib
>  /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl 
> debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
> orc_ep-install/lib/liborc.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl 
> double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a 
> /home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 
> /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm 
> -lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a 
> mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread 
> -lrt -Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && :
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284:
>  error: undefined reference to 'deflateInit2_'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232:
>  error: undefined reference to 'deflateReset'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254:
>  error: undefined reference to 'deflate'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291:
>  error: undefined reference to 'deflateEnd'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405:
>  error: undefined reference to 'inflateInit2_'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430:
>  error: undefined reference to 'inflateEnd'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471:
>  error: undefined reference to 'inflateReset'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477:
>  error: undefined reference to 'inflate'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820:
>  error: undefined reference to 'snappy::GetUncompressedLength(char const*, 
> unsigned long, unsigned long*)'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828:
>  error: undefined reference to 'snappy::RawUncompress(char const*, unsigned 
> long, char*)'
> /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894:
>  error: undefined reference to 'LZ4_decompress_safe'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7170) [C++] Bundled ORC fails linking

2019-11-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7170:
-

 Summary: [C++] Bundled ORC fails linking
 Key: ARROW-7170
 URL: https://issues.apache.org/jira/browse/ARROW-7170
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This shows up when building the tests as well:
{code}
[1/2] Linking CXX executable debug/orc-adapter-test
FAILED: debug/orc-adapter-test 
: && /usr/bin/ccache /usr/bin/clang++-7  -Qunused-arguments -fcolor-diagnostics 
-fuse-ld=gold -ggdb -O0  -Wall -Wextra -Wdocumentation 
-Wno-unused-parameter -Wno-unknown-warning-option -Werror 
-Wno-unknown-warning-option -msse4.2 -maltivec  -D_GLIBCXX_USE_CXX11_ABI=1 
-D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g  -rdynamic 
src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o  -o 
debug/orc-adapter-test  
-Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib
 /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
orc_ep-install/lib/liborc.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl 
double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libssl.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a 
/home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 
/home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm 
-lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a 
mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread -lrt 
-Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && :
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284:
 error: undefined reference to 'deflateInit2_'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232:
 error: undefined reference to 'deflateReset'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254:
 error: undefined reference to 'deflate'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291:
 error: undefined reference to 'deflateEnd'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405:
 error: undefined reference to 'inflateInit2_'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430:
 error: undefined reference to 'inflateEnd'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471:
 error: undefined reference to 'inflateReset'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477:
 error: undefined reference to 'inflate'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820:
 error: undefined reference to 'snappy::GetUncompressedLength(char const*, 
unsigned long, unsigned long*)'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828:
 error: undefined reference to 'snappy::RawUncompress(char const*, unsigned 
long, char*)'
/home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894:
 error: undefined reference to 'LZ4_decompress_safe'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7169) [C++] Vendor uriparser library

2019-11-14 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7169:
-

 Summary: [C++] Vendor uriparser library
 Key: ARROW-7169
 URL: https://issues.apache.org/jira/browse/ARROW-7169
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou


The [uriparser C library|https://github.com/uriparser/uriparser]  is used 
internally for URI parsing. Instead of having an explicit dependency, we could 
simply vendor it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7169) [C++] Vendor uriparser library

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974301#comment-16974301
 ] 

Antoine Pitrou commented on ARROW-7169:
---

[~npr] This might ease the work for R a bit.

> [C++] Vendor uriparser library
> --
>
> Key: ARROW-7169
> URL: https://issues.apache.org/jira/browse/ARROW-7169
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> The [uriparser C library|https://github.com/uriparser/uriparser]  is used 
> internally for URI parsing. Instead of having an explicit dependency, we 
> could simply vendor it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6633) [C++] Do not require double-conversion for default build

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6633:
--
Labels: pull-request-available  (was: )

> [C++] Do not require double-conversion for default build
> 
>
> Key: ARROW-6633
> URL: https://issues.apache.org/jira/browse/ARROW-6633
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This library is only needed in core builds if
> * ARROW_JSON=on or
> * ARROW_CSV=on (option to be added) or
> * ARROW_BUILD_TESTS=on 
> The double conversion headers leak into 
> * arrow/util/decimal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-7162:
-

Assignee: Antoine Pitrou

> [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
> ---
>
> Key: ARROW-7162
> URL: https://issues.apache.org/jira/browse/ARROW-7162
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> For clang we currently disable a lot of warnings explicitly. This dates back 
> to when we enabled {{-Weverything}}. We should probably remove most or all of 
> these flags now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-7162.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5828
[https://github.com/apache/arrow/pull/5828]

> [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
> ---
>
> Key: ARROW-7162
> URL: https://issues.apache.org/jira/browse/ARROW-7162
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For clang we currently disable a lot of warnings explicitly. This dates back 
> to when we enabled {{-Weverything}}. We should probably remove most or all of 
> these flags now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6633) [C++] Do not require double-conversion for default build

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6633:
-

Assignee: Antoine Pitrou

> [C++] Do not require double-conversion for default build
> 
>
> Key: ARROW-6633
> URL: https://issues.apache.org/jira/browse/ARROW-6633
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> This library is only needed in core builds if
> * ARROW_JSON=on or
> * ARROW_CSV=on (option to be added) or
> * ARROW_BUILD_TESTS=on 
> The double conversion headers leak into 
> * arrow/util/decimal.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing

2019-11-14 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974282#comment-16974282
 ] 

Wes McKinney commented on ARROW-6818:
-

I recently spent a lot of time on these documents. It's time for someone else 
to take a turn. 

> [Doc] Format docs confusing
> ---
>
> Key: ARROW-6818
> URL: https://issues.apache.org/jira/browse/ARROW-6818
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Major
>
> I find there are several issues in the format docs.
> 1) there is a claimed distinction between "logical types" and "physical 
> types", but the "physical types" actually lists logical types such as Map
> 2) the "logical types" document doesn't actually list logical types, it just 
> sends to the flatbuffers file. One shouldn't have to read a flatbuffers file 
> to understand the Arrow format.
> 3) some terminology seems unusual, such as "relative type"
> 4) why is there a link to the Apache Drill docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-11-14 Thread SURESH CHAGANTI (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974270#comment-16974270
 ] 

SURESH CHAGANTI commented on ARROW-4890:


thank you [~emkornfi...@gmail.com] appreciate your time, I have ran multiple 
tests with different sizes of the data and shard with more than 2GB size was 
failed.  glad to see the fix is in progress and it will be great if this change 
rolls out sooner. I will be happy to contribute to this issue.

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6636) [C++] Do not build C++ command line utilities by default

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6636:
--
Labels: pull-request-available  (was: )

> [C++] Do not build C++ command line utilities by default
> 
>
> Key: ARROW-6636
> URL: https://issues.apache.org/jira/browse/ARROW-6636
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These 
> are mostly used for integration testing, so building unit or integration 
> tests should toggle this on automatically. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6635) [C++] Do not require glog for default build

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6635:
--
Labels: pull-request-available  (was: )

> [C++] Do not require glog for default build
> ---
>
> Key: ARROW-6635
> URL: https://issues.apache.org/jira/browse/ARROW-6635
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We should change the default for {{ARROW_USE_GLOG}} to be off



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974243#comment-16974243
 ] 

Antoine Pitrou commented on ARROW-2196:
---

Is there something left to do here?

> [C++] Consider quarantining platform code with dependency on non-header Boost 
> code
> --
>
> Key: ARROW-2196
> URL: https://issues.apache.org/jira/browse/ARROW-2196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2078) [C++] Consider making DataType and its subclasses noncopyable

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974242#comment-16974242
 ] 

Antoine Pitrou commented on ARROW-2078:
---

Actually, we already have:
{code:c++}
  ARROW_DISALLOW_COPY_AND_ASSIGN(DataType);
{code}

Does this issue allude at something else? What exactly?

> [C++] Consider making DataType and its subclasses noncopyable
> -
>
> Key: ARROW-2078
> URL: https://issues.apache.org/jira/browse/ARROW-2078
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6257) [C++] Add fnmatch compatible globbing function

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6257.
---
  Assignee: (was: Ben Kietzman)
Resolution: Won't Fix

> [C++] Add fnmatch compatible globbing function
> --
>
> Key: ARROW-6257
> URL: https://issues.apache.org/jira/browse/ARROW-6257
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> This will be useful for the filesystems module and in datasource discovery, 
> which uses it.
> Behavior should be compatible with 
> http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6320) [C++] Arrow utilities are linked statically

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-6320.
-
Resolution: Done

This seems to have been fixed at some point.

> [C++] Arrow utilities are linked statically
> ---
>
> Key: ARROW-6320
> URL: https://issues.apache.org/jira/browse/ARROW-6320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> Even though other executables are linked dynamically with {{libarrow}} and 
> friends, the arrow utilities are linked statically on Linux:
> {code}
> $ ldd build-test/debug/arrow-stream-to-file 
>   linux-vdso.so.1 (0x7ffe353a8000)
>   libboost_filesystem.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
> (0x7f7baf7a1000)
>   libboost_system.so.1.67.0 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.67.0 
> (0x7f7baf59c000)
>   libstdc++.so.6 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f7bb0522000)
>   libgcc_s.so.1 => 
> /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f7bb050e000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f7baf37d000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f7baef8c000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f7bb0471000)
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f7baed84000)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f7bae9e6000)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974235#comment-16974235
 ] 

Antoine Pitrou commented on ARROW-4799:
---

Well, neither does the Expr / Operation hierarchy in compute, AFAICT.

> [C++] Propose alternative strategy for handling Operation logical output types
> --
>
> Key: ARROW-4799
> URL: https://issues.apache.org/jira/browse/ARROW-4799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Currently in the prototype work in ARROW-4782, operations are being "boxed" 
> in a strongly typed Expr types. An alternative structure would be for an 
> operation to define a virtual
> {code}
> virtual std::shared_ptr out_type() const = 0;
> {code}
> Where {{ArgType}} is some class that encodes the arity (array vs. scalar 
> vs) and value type (if any) that is emitted by the operation.
> Operations emitting multiple pieces of data would need some kind of "tuple" 
> object output. We can iterate on this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6257) [C++] Add fnmatch compatible globbing function

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974232#comment-16974232
 ] 

Antoine Pitrou commented on ARROW-6257:
---

Is this still relevant? I don't think being POSIX-fnmatch-compatible is 
important here. [~fsaintjacques]

> [C++] Add fnmatch compatible globbing function
> --
>
> Key: ARROW-6257
> URL: https://issues.apache.org/jira/browse/ARROW-6257
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> This will be useful for the filesystems module and in datasource discovery, 
> which uses it.
> Behavior should be compatible with 
> http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types

2019-11-14 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974231#comment-16974231
 ] 

Francois Saint-Jacques commented on ARROW-4799:
---

This is still relevant. The expression class in dataset does not represent 
logical (relational) operations.

> [C++] Propose alternative strategy for handling Operation logical output types
> --
>
> Key: ARROW-4799
> URL: https://issues.apache.org/jira/browse/ARROW-4799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Currently in the prototype work in ARROW-4782, operations are being "boxed" 
> in a strongly typed Expr types. An alternative structure would be for an 
> operation to define a virtual
> {code}
> virtual std::shared_ptr out_type() const = 0;
> {code}
> Where {{ArgType}} is some class that encodes the arity (array vs. scalar 
> vs) and value type (if any) that is emitted by the operation.
> Operations emitting multiple pieces of data would need some kind of "tuple" 
> object output. We can iterate on this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6612) [C++] Add ARROW_CSV CMake build flag

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974230#comment-16974230
 ] 

Antoine Pitrou commented on ARROW-6612:
---

I'd rather close this as won't fix. Our build system is already too convoluted.

> [C++] Add ARROW_CSV CMake build flag
> 
>
> Key: ARROW-6612
> URL: https://issues.apache.org/jira/browse/ARROW-6612
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> I think it would be better to make building this part of the project not 
> unconditional



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974229#comment-16974229
 ] 

Antoine Pitrou commented on ARROW-1292:
---

Is this still relevant? Does it make sense to focus efforts on HDFS?

> [C++/Python] Expand libhdfs feature coverage
> 
>
> Key: ARROW-1292
> URL: https://issues.apache.org/jira/browse/ARROW-1292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 1.0.0
>
>
> Umbrella JIRA. Will create child issues for more granular tasks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6615) [C++] Add filtering option to fs::Selector

2019-11-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques resolved ARROW-6615.
---
Resolution: Won't Fix

> [C++] Add filtering option to fs::Selector
> --
>
> Key: ARROW-6615
> URL: https://issues.apache.org/jira/browse/ARROW-6615
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It would convenient if Selector could support file path filtering, either via 
> a regex or globbing applied to the path.
> This is semi required for filtering file in Dataset to properly apply the 
> file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1403) [C++] Add variant of SerializeRecordBatch that accepts an writer-allocator callback

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-1403.
-
Resolution: Abandoned

Closing as outdated.

> [C++] Add variant of SerializeRecordBatch that accepts an writer-allocator 
> callback
> ---
>
> Key: ARROW-1403
> URL: https://issues.apache.org/jira/browse/ARROW-1403
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> When writing to other kinds of interfaces, like GPU, it would be useful to be 
> able to pass a function or closure that can instantiate an instance of 
> {{OutputStream}} (which might write to CPU memory, GPU, etc.) given the 
> computed size of the record batch. Currently we allocate new CPU memory and 
> write to that buffer, but this would eliminate an intermediate copy.
> So something like
> {code}
> typedef std::function*)> 
> StreamCreator;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1445) [Python] Segfault when using libhdfs3 in pyarrow using latest API

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-1445.
-
Resolution: Abandoned

Closing as outdated. Feel free to open a new issue if you still experience this.

> [Python] Segfault when using libhdfs3 in pyarrow using latest API
> -
>
> Key: ARROW-1445
> URL: https://issues.apache.org/jira/browse/ARROW-1445
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.6.0
>Reporter: James Porritt
>Priority: Major
>
> I'm encoutering a segfault when using libhdfs3 with pyarrow.
> My script is:
> {code}
> import pyarrow
> def main():
> hdfs = pyarrow.hdfs.connect("", , "", 
> driver='libhdfs')
> print hdfs.ls('')
> hdfs3a = pyarrow.HdfsClient("", , "", 
> driver='libhdfs3')
> print hdfs3a.ls('')
> hdfs3b = pyarrow.hdfs.connect("", , "", 
> driver='libhdfs3')
> print hdfs3b.ls('')
> main()
> {code}
> The first two hdfs connections yield the correct list. The third yields:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f69c0c8b57f, pid=88070, tid=140092200666880
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [libc.so.6+0x13357f]  __strlen_sse42+0xf
> {noformat}
> It dumps an error report file too.
> I created my conda environment with:
> {noformat}
> conda create -n parquet
> source activate parquet
> conda install pyarrow libhdfs3 -c conda-forge
> {noformat}
> The packages used are:
> {noformat}
> arrow-cpp 0.6.0   np113py27_1conda-forge
> boost-cpp 1.64.01conda-forge
> bzip2 1.0.6 1conda-forge
> ca-certificates   2017.7.27.1   0conda-forge
> certifi   2017.7.27.1  py27_0conda-forge
> curl  7.54.10conda-forge
> icu   58.1  1conda-forge
> krb5  1.14.20conda-forge
> libgcrypt 1.8.0 0conda-forge
> libgpg-error  1.27  0conda-forge
> libgsasl  1.8.0 1conda-forge
> libhdfs3  2.3   0conda-forge
> libiconv  1.14  4conda-forge
> libntlm   1.4   0conda-forge
> libssh2   1.8.0 1conda-forge
> libuuid   1.0.3 1conda-forge
> libxml2   2.9.4 4conda-forge
> mkl   2017.0.3  0  
> ncurses   5.9  10conda-forge
> numpy 1.13.1   py27_0  
> openssl   1.0.2l0conda-forge
> pandas0.20.3   py27_1conda-forge
> parquet-cpp   1.3.0.pre 1conda-forge
> pip   9.0.1py27_0conda-forge
> protobuf  3.3.2py27_0conda-forge
> pyarrow   0.6.0   np113py27_1conda-forge
> python2.7.131conda-forge
> python-dateutil   2.6.1py27_0conda-forge
> pytz  2017.2   py27_0conda-forge
> readline  6.2   0conda-forge
> setuptools36.2.2   py27_0conda-forge
> six   1.10.0   py27_1conda-forge
> sqlite3.13.01conda-forge
> tk8.5.192conda-forge
> wheel 0.29.0   py27_0conda-forge
> xz5.2.3 0conda-forge
> zlib  1.2.110conda-forge
> {noformat}
> I've set my ARROW_LIBHDFS_DIR to point at the location of the libhdfs3.so 
> file.
> I've populated my CLASSPATH as per the documentation.
> Please advise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974217#comment-16974217
 ] 

Antoine Pitrou commented on ARROW-6818:
---

cc [~wesm] [~emkornfield]

> [Doc] Format docs confusing
> ---
>
> Key: ARROW-6818
> URL: https://issues.apache.org/jira/browse/ARROW-6818
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Major
>
> I find there are several issues in the format docs.
> 1) there is a claimed distinction between "logical types" and "physical 
> types", but the "physical types" actually lists logical types such as Map
> 2) the "logical types" document doesn't actually list logical types, it just 
> sends to the flatbuffers file. One shouldn't have to read a flatbuffers file 
> to understand the Arrow format.
> 3) some terminology seems unusual, such as "relative type"
> 4) why is there a link to the Apache Drill docs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6511) [Developer] Remove run_docker_compose.sh

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-6511.
-
Resolution: Done

It seems the script was removed in the meantime.

> [Developer] Remove run_docker_compose.sh
> 
>
> Key: ARROW-6511
> URL: https://issues.apache.org/jira/browse/ARROW-6511
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Ben Kietzman
>Priority: Trivial
>
> dev/run_docker_compose.sh and Makefile.docker perform fundamentally the same 
> function: run docker-compose conveniently. Consolidating them is probably 
> worthwhile, and since Makefile.docker also builds dependencies it seems the 
> natural choice.
> update dev/README.md as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-5431) [C++] Protobuf fails building on Windows

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-5431.
-
Resolution: Abandoned

> [C++] Protobuf fails building on Windows
> 
>
> Key: ARROW-5431
> URL: https://issues.apache.org/jira/browse/ARROW-5431
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> Looks like this part of our build chain assumes Unix:
> {code}
> [55/489] Performing configure step for 'protobuf_ep'
> FAILED: protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure
> cmd.exe /C "cd /D 
> C:\t\arrow\cpp\build-debug\protobuf_ep-prefix\src\protobuf_ep
> && C:\Miniconda3\envs\arrow\Library\bin\cmake.exe -P 
> C:/t/arrow/cpp/build-debug/
> protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure-DEBUG.cmake && 
> C:
> \Miniconda3\envs\arrow\Library\bin\cmake.exe -E touch 
> C:/t/arrow/cpp/build-debug
> /protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure"
> CMake Error at 
> C:/t/arrow/cpp/build-debug/protobuf_ep-prefix/src/protobuf_ep-sta
> mp/protobuf_ep-configure-DEBUG.cmake:49 (message):
>   Command failed: %1 is not a valid Win32 application
>'./configure' 'AR=' 'RANLIB=' 'CC=C:/Program Files (x86)/Microsoft Visual 
> Stu
> dio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe' 
> 'CXX=C:/Min
> iconda3/envs/arrow/Scripts/clcache.exe' '--disable-shared' 
> '--prefix=C:/t/arrow/
> cpp/thirdparty/protobuf_ep-install' 'CFLAGS=/DWIN32 /D_WINDOWS /W3  /MDd /Zi 
> /Ob
> 0 /Od /RTC1' 'CXXFLAGS=/DWIN32 /D_WINDOWS  /GR /EHsc 
> /D_SILENCE_TR1_NAMESPACE_DE
> PRECATION_WARNING  /MDd /Zi /Ob0 /Od /RTC1'
>   See also
> 
> C:/t/arrow/cpp/build-debug/protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf
> _ep-configure-*.log
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974205#comment-16974205
 ] 

Antoine Pitrou commented on ARROW-4799:
---

Is this still relevant? It seems like the expression layer from datasets should 
be used instead.
cc [~fsaintjacques]

> [C++] Propose alternative strategy for handling Operation logical output types
> --
>
> Key: ARROW-4799
> URL: https://issues.apache.org/jira/browse/ARROW-4799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Currently in the prototype work in ARROW-4782, operations are being "boxed" 
> in a strongly typed Expr types. An alternative structure would be for an 
> operation to define a virtual
> {code}
> virtual std::shared_ptr out_type() const = 0;
> {code}
> Where {{ArgType}} is some class that encodes the arity (array vs. scalar 
> vs) and value type (if any) that is emitted by the operation.
> Operations emitting multiple pieces of data would need some kind of "tuple" 
> object output. We can iterate on this



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs

2019-11-14 Thread Thomas Buhrmann (Jira)
Thomas Buhrmann created ARROW-7168:
--

 Summary: pa.array() doesn't respect provided dictionary type with 
all NaNs
 Key: ARROW-7168
 URL: https://issues.apache.org/jira/browse/ARROW-7168
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.15.1
Reporter: Thomas Buhrmann


This might be related to ARROW-6548 and others dealing with all NaN columns. 
When creating a dictionary array, even when fully specifying the desired type, 
this type is not respected when the data contains only NaNs:


{code:python}
# This may look a little artificial but easily occurs when processing 
categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type
{code}

results in

{noformat}
>> DictionaryType(dictionary)
{noformat}

which means that one cannot e.g. serialize batches of categoricals if the 
possibility of all-NaN batches exists, even when trying to enforce that each 
batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd 
imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of 
the desired type containing only NaNs?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6003) [C++] Better input validation and error messaging in CSV reader

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6003:
--
Component/s: R

> [C++] Better input validation and error messaging in CSV reader
> ---
>
> Key: ARROW-6003
> URL: https://issues.apache.org/jira/browse/ARROW-6003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: csv
>
> Followup to https://issues.apache.org/jira/browse/ARROW-5747. The error 
> message(s) are not great when you give bad input. For example, if I give too 
> many or too few {{column_names}}, the error I get is {{Invalid: Empty CSV 
> file}}. In fact, that's about the only error message I've seen from the CSV 
> reader, no matter what I've thrown at it.
> It would be better if error messages were more specific so that I as a user 
> might know how to fix my bad input.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1099) [C++] Add support for PFOR integer compression

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974184#comment-16974184
 ] 

Antoine Pitrou commented on ARROW-1099:
---

Do we want to keep this open?

> [C++] Add support for PFOR integer compression
> --
>
> Key: ARROW-1099
> URL: https://issues.apache.org/jira/browse/ARROW-1099
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> https://github.com/lemire/FastPFor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6615) [C++] Add filtering option to fs::Selector

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974180#comment-16974180
 ] 

Antoine Pitrou commented on ARROW-6615:
---

Can we close this?

> [C++] Add filtering option to fs::Selector
> --
>
> Key: ARROW-6615
> URL: https://issues.apache.org/jira/browse/ARROW-6615
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> It would convenient if Selector could support file path filtering, either via 
> a regex or globbing applied to the path.
> This is semi required for filtering file in Dataset to properly apply the 
> file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974181#comment-16974181
 ] 

Antoine Pitrou commented on ARROW-2496:
---

Are you still willing to work on this?

> [C++] Add support for Libhdfs++
> ---
>
> Key: ARROW-2496
> URL: https://issues.apache.org/jira/browse/ARROW-2496
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: HDFS
>
> Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS 
> project. Details are available here.
> https://issues.apache.org/jira/browse/HDFS-8707
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4763) [C++/Python] Cannot build Gandiva in conda on OSX due to package conflicts

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974178#comment-16974178
 ] 

Antoine Pitrou commented on ARROW-4763:
---

Is this still relevant?

> [C++/Python] Cannot build Gandiva in conda on OSX due to package conflicts
> --
>
> Key: ARROW-4763
> URL: https://issues.apache.org/jira/browse/ARROW-4763
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, C++ - Gandiva, Python
>Reporter: Uwe Korn
>Priority: Major
>
> It is currently not reliably possible to build Gandiva with one of the conda 
> toolchains on OSX as the packages {{llvm==4.0.1}} (pulled in by the 
> compilers) and {{llvmdev==7.0.1}} conflict in some files: 
> https://github.com/conda-forge/llvmdev-feedstock/issues/60



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6972) [C#] Should support StructField arrays

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6972:
--
Summary: [C#] Should support StructField arrays  (was: C# should support 
StructField arrays)

> [C#] Should support StructField arrays
> --
>
> Key: ARROW-6972
> URL: https://issues.apache.org/jira/browse/ARROW-6972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Cameron Murray
>Priority: Major
>
> The C# implementation of Arrow does not support struct arrays and, complex 
> types more generally.
>  I notice ARROW-6870 addresses Dictionary arrays however this is not as 
> flexible as structs (for example, cannot mix data types)
> The source does have a stub for StructArray however there is no Builder nor 
> example on how to use it so I can assume it is not supported.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1921) [Doc] Build API docs on a per-release basis

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-1921:
--
Summary: [Doc] Build API docs on a per-release basis  (was: Build API docs 
on a per-release basis)

> [Doc] Build API docs on a per-release basis
> ---
>
> Key: ARROW-1921
> URL: https://issues.apache.org/jira/browse/ARROW-1921
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Uwe Korn
>Priority: Major
>
> Currently we build the docs from time to time manually from master. We should 
> also build them per release so that you can have a look at the latest 
> released API version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-851) C++/Python: Check Boost/Arrow C++ABI for consistency

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974176#comment-16974176
 ] 

Antoine Pitrou commented on ARROW-851:
--

Is this issue still relevant?

> C++/Python: Check Boost/Arrow C++ABI for consistency
> 
>
> Key: ARROW-851
> URL: https://issues.apache.org/jira/browse/ARROW-851
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Uwe Korn
>Priority: Major
>
> When building with dependencies from conda-forge on a newer system with GCC, 
> the C++ ABI versions can differ. We need to ensure that the versions match 
> between Boost, arrow-cpp and pyarrow in our CMake scripts.
> Depending on this, we may need to pass {{-D_GLIBCXX_USE_CXX11_ABI=0}} to 
> {{CMAKE_CXX_FLAGS}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4429) [Doc] Add git rebase tips to the 'Contributing' page in the developer docs

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4429:
--
Summary: [Doc] Add git rebase tips to the 'Contributing' page in the 
developer docs  (was: Add git rebase tips to the 'Contributing' page in the 
developer docs)

> [Doc] Add git rebase tips to the 'Contributing' page in the developer docs
> --
>
> Key: ARROW-4429
> URL: https://issues.apache.org/jira/browse/ARROW-4429
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Tanya Schlusser
>Priority: Major
>
> A recent discussion on the listserv (link below) asked about how contributors 
> should handle rebasing. It would be helpful if the tips made it into the 
> developer documentation somehow. I suggest in the ["Contributing to Apache 
> Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow]
>  page—currently a wiki, but hopefully eventually part of the Sphinx docs 
> ARROW-4427.
> Here is the relevant thread:
> [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5067) [Doc] Adding support for versioned documentation

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5067:
--
Summary: [Doc] Adding support for versioned documentation  (was: Adding 
support for versioned documentation)

> [Doc] Adding support for versioned documentation
> 
>
> Key: ARROW-5067
> URL: https://issues.apache.org/jira/browse/ARROW-5067
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 0.9.0, 0.12.1
>Reporter: Helen Ngo
>Priority: Minor
>  Labels: beginner, documentation, newbie, starter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Per this GitHub issue: [https://github.com/apache/arrow/issues/4016]
> *The request:* API docs for previous PyArrow versions as a drop-down link in 
> the sidebar of this page: [https://arrow.apache.org/docs/python/api.html]
> I often work in multiple environments with PyArrow versions ranging from 
> 0.8.0 to 0.12+, and the core Python API has changed quite a bit 
> version-to-version so code occasionally no longer works. It can be difficult 
> to hunt down the corresponding changes in each version. Versioned docs 
> similar to pandas and other popular libraries would be excellent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6389) java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6389:
--
Priority: Major  (was: Blocker)

> java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
> 
>
> Key: ARROW-6389
> URL: https://issues.apache.org/jira/browse/ARROW-6389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Affects Versions: 0.14.1
> Environment: Hadoop 2.85
> EMR 5.24.1
> python version: 3.7.4
> skein version: 0.8.0
>Reporter: Ben Schreck
>Priority: Major
>
> I can't access hdfs through pyarrow ( from inside a yarn container created by 
> skein)
> This code works in a jupyter notebook running on the master node, or in an 
> ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:
> ```{{import pyarrow; pyarrow.hdfs.connect()```}}
>  
> However, when running on yarn by submitting the following skein application, 
> I get a Java error.
>  
> {{name: test_conn
> queue: default
> master:
>   env:
> ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
> JAVA_HOME: /etc/alternatives/jre
>   resources:
> vcores: 1
> memory: 10 GiB
>   files:
> conda_env: /home/hadoop/environment.tar.gz
>   script: |
> echo $HADOOP_HOME
> echo $JAVA_HOME
> echo $HADOOP_CLASSPATH
> echo $ARROW_LIBHDFS_DIR
> source conda_env/bin/activate
> python -c "import pyarrow; pyarrow.hdfs.connect(); 
> print(fs.open('test.txt').read())"
> echo "Hello World!"}}
> FYI I tried with/without all those extra env vars, to no effect. I also tried 
> modifying the EMR cluster with any of the following
>  
> {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
> "fs.AbstractFileSystem.hdfs.impl": 
> "org.apache.hadoop.hdfs.DistributedFileSystem"
> "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}
> The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- 
> it was able to find which class by name to use for the "hdfs://" prefix, 
> namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find 
> that class.
> Logs:
>  
> {{=
> LogType:application.driver.log
> Log Upload Time:Thu Aug 29 20:51:59 + 2019
> LogLength:2635
> Log Contents:
> /usr/lib/hadoop
> /usr/lib/jvm/java-openjdk
> :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
> hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=(NULL)) error:
> java.io.IOException: No FileSystem for scheme: hdfs
> at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
> extra_conf=extra_conf)
>   File 
> "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  

[jira] [Updated] (ARROW-6389) [Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6389:
--
Summary: [Python] java.io.IOException: No FileSystem for scheme: hdfs [On 
AWS EMR]  (was: java.io.IOException: No FileSystem for scheme: hdfs [On AWS 
EMR])

> [Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
> -
>
> Key: ARROW-6389
> URL: https://issues.apache.org/jira/browse/ARROW-6389
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Affects Versions: 0.14.1
> Environment: Hadoop 2.85
> EMR 5.24.1
> python version: 3.7.4
> skein version: 0.8.0
>Reporter: Ben Schreck
>Priority: Major
>
> I can't access hdfs through pyarrow ( from inside a yarn container created by 
> skein)
> This code works in a jupyter notebook running on the master node, or in an 
> ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var:
> ```{{import pyarrow; pyarrow.hdfs.connect()```}}
>  
> However, when running on yarn by submitting the following skein application, 
> I get a Java error.
>  
> {{name: test_conn
> queue: default
> master:
>   env:
> ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native
> JAVA_HOME: /etc/alternatives/jre
>   resources:
> vcores: 1
> memory: 10 GiB
>   files:
> conda_env: /home/hadoop/environment.tar.gz
>   script: |
> echo $HADOOP_HOME
> echo $JAVA_HOME
> echo $HADOOP_CLASSPATH
> echo $ARROW_LIBHDFS_DIR
> source conda_env/bin/activate
> python -c "import pyarrow; pyarrow.hdfs.connect(); 
> print(fs.open('test.txt').read())"
> echo "Hello World!"}}
> FYI I tried with/without all those extra env vars, to no effect. I also tried 
> modifying the EMR cluster with any of the following
>  
> {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs"
> "fs.AbstractFileSystem.hdfs.impl": 
> "org.apache.hadoop.hdfs.DistributedFileSystem"
> "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}}
> The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- 
> it was able to find which class by name to use for the "hdfs://" prefix, 
> namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find 
> that class.
> Logs:
>  
> {{=
> LogType:application.driver.log
> Log Upload Time:Thu Aug 29 20:51:59 + 2019
> LogLength:2635
> Log Contents:
> /usr/lib/hadoop
> /usr/lib/jvm/java-openjdk
> :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
> hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=(NULL)) error:
> java.io.IOException: No FileSystem for scheme: hdfs
> at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414)
> at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
> at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 215, in connect
> extra_conf=extra_conf)
>   File 
> 

[jira] [Commented] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974175#comment-16974175
 ] 

Antoine Pitrou commented on ARROW-6361:
---

[~emkornfield]

> [Java] sbt docker publish fails due to Arrow dependecies
> 
>
> Key: ARROW-6361
> URL: https://issues.apache.org/jira/browse/ARROW-6361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
> Attachments: tree.txt
>
>
> Hello guys
> I'm using Arrow in my Scala project and included Maven deps in sbt as 
> required.
> However, when I try to publish a Docker container with sbt 'docker:publish', 
> I get the following error:
> [error] 1 error was encountered during merge
> [error] java.lang.RuntimeException: deduplicate: different file contents 
> found in the following:
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties
> My project is [here|https://github.com/Clover-Group/tsp/tree/kafka].
> You may check project dependency tree attached.
> How do I fix that?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6361:
--
Summary: [Java] sbt docker publish fails due to Arrow dependecies  (was: 
sbt docker publish fails due to Arrow dependecies)

> [Java] sbt docker publish fails due to Arrow dependecies
> 
>
> Key: ARROW-6361
> URL: https://issues.apache.org/jira/browse/ARROW-6361
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
> Attachments: tree.txt
>
>
> Hello guys
> I'm using Arrow in my Scala project and included Maven deps in sbt as 
> required.
> However, when I try to publish a Docker container with sbt 'docker:publish', 
> I get the following error:
> [error] 1 error was encountered during merge
> [error] java.lang.RuntimeException: deduplicate: different file contents 
> found in the following:
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties
> [error] 
> /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties
> My project is [here|https://github.com/Clover-Group/tsp/tree/kafka].
> You may check project dependency tree attached.
> How do I fix that?
> Thank you



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974174#comment-16974174
 ] 

Antoine Pitrou commented on ARROW-7158:
---

cc [~kou]

> [C++][Visual Studio]Build config Error on non English Version visual studio.
> 
>
> Key: ARROW-7158
> URL: https://issues.apache.org/jira/browse/ARROW-7158
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Yiun Seungryong
>Priority: Minor
>
> * Build Config Error on Non English OS
>  * always show 
> {code:java}
>  Not supported MSVC compiler  {code}
>  * 
> [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44]
> There is a bug in the code below.
> {code:java}
> if(MSVC)
>  set(COMPILER_FAMILY "msvc")
>  if("${COMPILER_VERSION_FULL}" MATCHES
>  ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code}
>  * In my compiler the version display contains Korean.
> {code:java}
> Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code}
>  * Regular expression seems to need to be changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-14 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-7154.
-
Resolution: Cannot Reproduce

> [C++] Build error when building tests but not with snappy
> -
>
> Key: ARROW-7154
> URL: https://issues.apache.org/jira/browse/ARROW-7154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Since the docker-compose PR landed, I am having build errors like:
> {code:java}
> [361/376] Linking CXX executable debug/arrow-python-test
> FAILED: debug/arrow-python-test
> : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
> /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
> -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
> -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
> -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
> -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
> -msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro 
> -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
> src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
> debug/arrow-python-test  
> -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
>  debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
> debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread 
> -lpthread -ldl  -lutil -lrt -ldl 
> /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
> /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
> /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
> not found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
> found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::system::detail::generic_category_ncx()'
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::filesystem::path::operator/=(boost::filesystem::path const&)'
> collect2: error: ld returned 1 exit status
> {code}
> which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed 
> by debug/libarrow.so.100.0.0, not found" (although this is certainly present).
> The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set 
> to OFF, it works fine.
> It also seems to be related to this specific change in the docker compose PR:
> {code:java}
> diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> index c80ac3310..3b3c9eb8f 100644
> --- a/cpp/CMakeLists.txt
> +++ b/cpp/CMakeLists.txt
> @@ -266,6 +266,15 @@ endif(UNIX)
>  # Set up various options
>  #
> -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
> -  # Currently the compression tests require at least these libraries; bz2 and
> -  # zstd are optional. See ARROW-3984
> -  set(ARROW_WITH_BROTLI ON)
> -  set(ARROW_WITH_LZ4 ON)
> -  set(ARROW_WITH_SNAPPY ON)
> -  set(ARROW_WITH_ZLIB ON)
> -endif()
> -
>  if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
>set(ARROW_JSON ON)
>  endif()
> {code}
> If I add that back, the build works.
> With only `set(ARROW_WITH_BROTLI ON)`, it still fails
>  With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
> liblz4 instead of libboost (but also liblz4 is actually present)
>  With only `set(ARROW_WITH_SNAPPY ON)`, it works
>  With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
> libz.so.1 not found
> With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also 
> works.  So it seems that the absence of snappy causes others to fail.
> In the recommended build settings in the development docs 
> ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
>  the compression libraries are enabled. But I was still building without them 
> (stemming from the time they were enabled by 

[jira] [Commented] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-14 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974173#comment-16974173
 ] 

Antoine Pitrou commented on ARROW-7154:
---

Ok, closing.

> [C++] Build error when building tests but not with snappy
> -
>
> Key: ARROW-7154
> URL: https://issues.apache.org/jira/browse/ARROW-7154
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Since the docker-compose PR landed, I am having build errors like:
> {code:java}
> [361/376] Linking CXX executable debug/arrow-python-test
> FAILED: debug/arrow-python-test
> : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
> /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
> -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
> -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
> -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
> -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
> -msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro 
> -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
> src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
> debug/arrow-python-test  
> -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
>  debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
> debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
> /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread 
> -lpthread -ldl  -lutil -lrt -ldl 
> /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
> /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
> jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
> /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
> not found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
> found (try using -rpath or -rpath-link)
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::system::detail::generic_category_ncx()'
> /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
>  debug/libarrow.so.100.0.0: undefined reference to 
> `boost::filesystem::path::operator/=(boost::filesystem::path const&)'
> collect2: error: ld returned 1 exit status
> {code}
> which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed 
> by debug/libarrow.so.100.0.0, not found" (although this is certainly present).
> The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set 
> to OFF, it works fine.
> It also seems to be related to this specific change in the docker compose PR:
> {code:java}
> diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> index c80ac3310..3b3c9eb8f 100644
> --- a/cpp/CMakeLists.txt
> +++ b/cpp/CMakeLists.txt
> @@ -266,6 +266,15 @@ endif(UNIX)
>  # Set up various options
>  #
> -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
> -  # Currently the compression tests require at least these libraries; bz2 and
> -  # zstd are optional. See ARROW-3984
> -  set(ARROW_WITH_BROTLI ON)
> -  set(ARROW_WITH_LZ4 ON)
> -  set(ARROW_WITH_SNAPPY ON)
> -  set(ARROW_WITH_ZLIB ON)
> -endif()
> -
>  if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
>set(ARROW_JSON ON)
>  endif()
> {code}
> If I add that back, the build works.
> With only `set(ARROW_WITH_BROTLI ON)`, it still fails
>  With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
> liblz4 instead of libboost (but also liblz4 is actually present)
>  With only `set(ARROW_WITH_SNAPPY ON)`, it works
>  With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
> libz.so.1 not found
> With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also 
> works.  So it seems that the absence of snappy causes others to fail.
> In the recommended build settings in the development docs 
> ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
>  the compression libraries are enabled. But I was still building without them 
> (stemming from the time they 

[jira] [Updated] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake

2019-11-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7162:
--
Labels: pull-request-available  (was: )

> [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
> ---
>
> Key: ARROW-7162
> URL: https://issues.apache.org/jira/browse/ARROW-7162
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Developer Tools
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> For clang we currently disable a lot of warnings explicitly. This dates back 
> to when we enabled {{-Weverything}}. We should probably remove most or all of 
> these flags now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7163) [Doc] Fix double-and typos

2019-11-14 Thread Brian Wignall (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974169#comment-16974169
 ] 

Brian Wignall commented on ARROW-7163:
--

Thank you for creating the Jira issue for me, Neal.

> [Doc] Fix double-and typos
> --
>
> Key: ARROW-7163
> URL: https://issues.apache.org/jira/browse/ARROW-7163
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >