[jira] [Comment Edited] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974875#comment-16974875 ] Micah Kornfield edited comment on ARROW-6112 at 11/15/19 7:19 AM: -- based on discussion on mailing list, there was a request for a rebase on the original PR that accommodated all vector APIs. I haven't and probably won't have time to do this. It was mentioned that just redoing ArrowBuf to use 64-bit address space might be more palatable. So I'm going to focus on making that happen. This will make it possible to support LargeBinary and LargeString (with the limitation that string length will still probably be practically limited to 2GB for most APIs). However, for LargeArray child arrays will still be limited to 2 billion entries so this would be of limited utility. was (Author: emkornfi...@gmail.com): based on discussion on mailing list, there was a request for a rebase on the original PR that accommodated all vector APIs. I haven't and probably won't have time to do this. It was mentioned that just redoing ArrowBuf to use 64-bit address space might be more palatable. So I'm going to focus on making that happen. > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-6112: --- Description: The arrow spec allows for 64 bit address range for buffers (and arrays) we should support this at the API level in Java even if the current Netty backing buffers don't support it. See comment below. This work item will focus on allowing 64-bit addressing in buffers. was:The arrow spec allows for 64 bit address range for buffers (and arrays) we should support this at the API level in Java even if the current Netty backing buffers don't support it. > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. > > See comment below. This work item will focus on allowing 64-bit addressing > in buffers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6112) [Java] Update APIs to support 64-bit address space
[ https://issues.apache.org/jira/browse/ARROW-6112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974875#comment-16974875 ] Micah Kornfield commented on ARROW-6112: based on discussion on mailing list, there was a request for a rebase on the original PR that accommodated all vector APIs. I haven't and probably won't have time to do this. It was mentioned that just redoing ArrowBuf to use 64-bit address space might be more palatable. So I'm going to focus on making that happen. > [Java] Update APIs to support 64-bit address space > -- > > Key: ARROW-6112 > URL: https://issues.apache.org/jira/browse/ARROW-6112 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > The arrow spec allows for 64 bit address range for buffers (and arrays) we > should support this at the API level in Java even if the current Netty > backing buffers don't support it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6887) [Java] Create prose documentation for using ValueVectors
[ https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974862#comment-16974862 ] Micah Kornfield commented on ARROW-6887: i believe we mostly only update documentation after release. > [Java] Create prose documentation for using ValueVectors > > > Key: ARROW-6887 > URL: https://issues.apache.org/jira/browse/ARROW-6887 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 40m > Remaining Estimate: 0h > > We should create documentation (in restructured text) for the library that > demonstrates: > 1. Basic construction of ValueVectors. Highlighting: > * ValueVector lifecycle > * Reading by rows using Readers (mentioning that it is not as efficient > as direct access). > * Populating with Writers > 2. Reading and writing IPC stream format and file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-1175) [Java] Implement/test dictionary-encoded subfields
[ https://issues.apache.org/jira/browse/ARROW-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu resolved ARROW-1175. --- Resolution: Fixed > [Java] Implement/test dictionary-encoded subfields > -- > > Key: ARROW-1175 > URL: https://issues.apache.org/jira/browse/ARROW-1175 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Wes McKinney >Assignee: Ji Liu >Priority: Major > Fix For: 1.0.0 > > > We do not have any tests about types like: > {code} > List > {code} > cc [~julienledem] [~elahrvivaz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type
[ https://issues.apache.org/jira/browse/ARROW-6600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu closed ARROW-6600. - Resolution: Later > [Java] Implement dictionary-encoded subfields for Union type > > > Key: ARROW-6600 > URL: https://issues.apache.org/jira/browse/ARROW-6600 > Project: Apache Arrow > Issue Type: Sub-task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Implement dictionary-encoded subfields for {{Union}} type. Each child vector > could be encodable or not. > > Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor > List subfield encoding to keep consistent with {{Struct/Union}} type. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6019) [Java] Port Jdbc and Avro adapter to new directory
[ https://issues.apache.org/jira/browse/ARROW-6019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu closed ARROW-6019. - Resolution: Won't Do > [Java] Port Jdbc and Avro adapter to new directory > --- > > Key: ARROW-6019 > URL: https://issues.apache.org/jira/browse/ARROW-6019 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > > As discussed in mail list, adapters are different from native reader. > This issue is used to track these issues: > i. create new “contrib” directory and move Jdbc/Avro adapter to it. > ii. provide more description. > iii. change orc readers structure to “converter" > cc [~emkornfi...@gmail.com] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7175) [Website] Add a security page to track when vulnerabilities are patched
Micah Kornfield created ARROW-7175: -- Summary: [Website] Add a security page to track when vulnerabilities are patched Key: ARROW-7175 URL: https://issues.apache.org/jira/browse/ARROW-7175 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Micah Kornfield we might also want to give a brief tutorial on safely using the C++ library (e.g. pointers to validation methods). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6887) [Java] Create prose documentation for using ValueVectors
[ https://issues.apache.org/jira/browse/ARROW-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974842#comment-16974842 ] Ji Liu commented on ARROW-6887: --- Seems the web content([http://arrow.apache.org/docs/]) is not updated yet, should we do something to make this docs work? [~wesm] [~emkornfi...@gmail.com] > [Java] Create prose documentation for using ValueVectors > > > Key: ARROW-6887 > URL: https://issues.apache.org/jira/browse/ARROW-6887 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 40m > Remaining Estimate: 0h > > We should create documentation (in restructured text) for the library that > demonstrates: > 1. Basic construction of ValueVectors. Highlighting: > * ValueVector lifecycle > * Reading by rows using Readers (mentioning that it is not as efficient > as direct access). > * Populating with Writers > 2. Reading and writing IPC stream format and file formats. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7150) [Python] Explain parquet file size growth
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-7150. Resolution: Not A Problem > [Python] Explain parquet file size growth > - > > Key: ARROW-7150 > URL: https://issues.apache.org/jira/browse/ARROW-7150 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OS X >Reporter: Bogdan Klichuk >Priority: Major > Attachments: 820.parquet > > > Having columnar storage format in mind, with gzip compression enabled, I > can't make sense of how parquet file size is growing in my specific example. > So far without sharing a dataset (would need to create a mock one to share). > {code:java} > > # 1. read 820 rows from a parquet file > > df.read_parquet('820.parquet') > > # size of 820.parquet is 528K > > len(df) > 820 > > # 2. write 8200 rows to a parquet file > > df_big = pandas.concat([df] * 10).reset_index(drop=True) > > len(df_big) > 8200 > > df_big.to_parquet('8200.parquet', compression='gzip') > > # size of 800.parquet is 33M. Why is it 60 times bigger? > {code} > > Compression works better on bigger files. How come 10x1 increase with > repeated data resulted in 60x growth of file? Insane imo. > > Working on a periodic job that concats smaller files into bigger ones and > doubting now whether I need this. > > I attached 820.parquet to try out -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size growth
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974827#comment-16974827 ] Micah Kornfield commented on ARROW-7150: [~klichukb] I don't think either Parquet or Avro is going to give you great performance if you are simply sticking JSON strings in as data. If you want smaller file sizes, two columns for parquet, one with field name and one with field value (boolean) would do a lot better. Unfortunately, the Arrow library doesn't support writing nested data yet (otherwise list) would be preferred. I opened https://issues.apache.org/jira/browse/ARROW-7174 which if implemented and there is a lot of duplication in strings might still get useful compression. For now I think this is working as intended. > [Python] Explain parquet file size growth > - > > Key: ARROW-7150 > URL: https://issues.apache.org/jira/browse/ARROW-7150 > Project: Apache Arrow > Issue Type: Task > Components: Python >Affects Versions: 0.15.1 > Environment: Mac OS X >Reporter: Bogdan Klichuk >Priority: Major > Attachments: 820.parquet > > > Having columnar storage format in mind, with gzip compression enabled, I > can't make sense of how parquet file size is growing in my specific example. > So far without sharing a dataset (would need to create a mock one to share). > {code:java} > > # 1. read 820 rows from a parquet file > > df.read_parquet('820.parquet') > > # size of 820.parquet is 528K > > len(df) > 820 > > # 2. write 8200 rows to a parquet file > > df_big = pandas.concat([df] * 10).reset_index(drop=True) > > len(df_big) > 8200 > > df_big.to_parquet('8200.parquet', compression='gzip') > > # size of 800.parquet is 33M. Why is it 60 times bigger? > {code} > > Compression works better on bigger files. How come 10x1 increase with > repeated data resulted in 60x growth of file? Insane imo. > > Working on a periodic job that concats smaller files into bigger ones and > doubting now whether I need this. > > I attached 820.parquet to try out -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7174) [Python] Expose dictionary size parameter in python.
Micah Kornfield created ARROW-7174: -- Summary: [Python] Expose dictionary size parameter in python. Key: ARROW-7174 URL: https://issues.apache.org/jira/browse/ARROW-7174 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Micah Kornfield In some cases it might be useful to have dictionaries larger then the current default 1MB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7174) [Python] Expose parquet dictionary size write parameter in python.
[ https://issues.apache.org/jira/browse/ARROW-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated ARROW-7174: --- Summary: [Python] Expose parquet dictionary size write parameter in python. (was: [Python] Expose dictionary size parameter in python.) > [Python] Expose parquet dictionary size write parameter in python. > -- > > Key: ARROW-7174 > URL: https://issues.apache.org/jira/browse/ARROW-7174 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Micah Kornfield >Priority: Major > > In some cases it might be useful to have dictionaries larger then the current > default 1MB. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6820. - Resolution: Fixed Issue resolved by pull request 5821 [https://github.com/apache/arrow/pull/5821] > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-6820: --- Assignee: Bryan Cutler > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Assignee: Bryan Cutler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7173) Add test to verify Map field names can be arbitrary
Bryan Cutler created ARROW-7173: --- Summary: Add test to verify Map field names can be arbitrary Key: ARROW-7173 URL: https://issues.apache.org/jira/browse/ARROW-7173 Project: Apache Arrow Issue Type: Test Components: Integration Reporter: Bryan Cutler A Map has child fields and the format spec only recommends that they be named "entries", "key", and "value" but could be named anything. Currently, integration tests for Map arrays verify the exchanged schema is equal, so the child fields are always named the same. There should be tests that use different names to verify implementations can accept this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies
[ https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974823#comment-16974823 ] Micah Kornfield commented on ARROW-6361: [~tampler] I think in the short term updating documentation for this would be helpful (would you like to open a PR?). i'm not an expert on SBT or Maven but is there some way to potentially fix this to avoid the workaround? > [Java] sbt docker publish fails due to Arrow dependecies > > > Key: ARROW-6361 > URL: https://issues.apache.org/jira/browse/ARROW-6361 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.14.1 >Reporter: Boris V.Kuznetsov >Priority: Major > Attachments: tree.txt > > > Hello guys > I'm using Arrow in my Scala project and included Maven deps in sbt as > required. > However, when I try to publish a Docker container with sbt 'docker:publish', > I get the following error: > [error] 1 error was encountered during merge > [error] java.lang.RuntimeException: deduplicate: different file contents > found in the following: > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties > My project is [here|https://github.com/Clover-Group/tsp/tree/kafka]. > You may check project dependency tree attached. > How do I fix that? > Thank you -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1099) [C++] Add support for PFOR integer compression
[ https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974818#comment-16974818 ] Micah Kornfield commented on ARROW-1099: I would rather discuss this a possible follow-up to the compression proposal I have out, so lets close for now. > [C++] Add support for PFOR integer compression > -- > > Key: ARROW-1099 > URL: https://issues.apache.org/jira/browse/ARROW-1099 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > https://github.com/lemire/FastPFor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-1099) [C++] Add support for PFOR integer compression
[ https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-1099. Resolution: Later > [C++] Add support for PFOR integer compression > -- > > Key: ARROW-1099 > URL: https://issues.apache.org/jira/browse/ARROW-1099 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > https://github.com/lemire/FastPFor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6818) [Doc] Format docs confusing
[ https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-6818: -- Assignee: Micah Kornfield > [Doc] Format docs confusing > --- > > Key: ARROW-6818 > URL: https://issues.apache.org/jira/browse/ARROW-6818 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Format >Reporter: Antoine Pitrou >Assignee: Micah Kornfield >Priority: Major > > I find there are several issues in the format docs. > 1) there is a claimed distinction between "logical types" and "physical > types", but the "physical types" actually lists logical types such as Map > 2) the "logical types" document doesn't actually list logical types, it just > sends to the flatbuffers file. One shouldn't have to read a flatbuffers file > to understand the Arrow format. > 3) some terminology seems unusual, such as "relative type" > 4) why is there a link to the Apache Drill docs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing
[ https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974817#comment-16974817 ] Micah Kornfield commented on ARROW-6818: i'll see if i can address these. > [Doc] Format docs confusing > --- > > Key: ARROW-6818 > URL: https://issues.apache.org/jira/browse/ARROW-6818 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Format >Reporter: Antoine Pitrou >Assignee: Micah Kornfield >Priority: Major > > I find there are several issues in the format docs. > 1) there is a claimed distinction between "logical types" and "physical > types", but the "physical types" actually lists logical types such as Map > 2) the "logical types" document doesn't actually list logical types, it just > sends to the flatbuffers file. One shouldn't have to read a flatbuffers file > to understand the Arrow format. > 3) some terminology seems unusual, such as "relative type" > 4) why is there a link to the Apache Drill docs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only
[ https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6930. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5693 [https://github.com/apache/arrow/pull/5693] > [Java] Create utility class for populating vector values used for test > purpose only > --- > > Key: ARROW-6930 > URL: https://issues.apache.org/jira/browse/ARROW-6930 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 11h 20m > Remaining Estimate: 0h > > There is a lot of verbosity in the construction of Arrays for testing > purposes (multiple lines of setSafe(...) or set(...). > We should start adding a utility class to make test setup clearer and more > concise, note this class should be located in arrow-vector test package and > could be used in other module’s testing by adding dependency: > {{}} > {{org.apache.arrow}} > {{arrow-vector}} > {{${project.version}}} > {{tests}} > {{test-jar}} > {{test}} > {{}} > Usage would be something like: > {quote}try (IntVector vector = new IntVector(“vector”, allocator)) { > ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5); > output = doSomethingWith(input); > assertThat(output).isEqualTo(expected); > } > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only
[ https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu updated ARROW-6930: -- Description: There is a lot of verbosity in the construction of Arrays for testing purposes (multiple lines of setSafe(...) or set(...). We should start adding a utility class to make test setup clearer and more concise, note this class should be located in arrow-vector test package and could be used in other module’s testing by adding dependency: {{}} {{org.apache.arrow}} {{arrow-vector}} {{${project.version}}} {{tests}} {{test-jar}} {{test}} {{}} Usage would be something like: {quote}try (IntVector vector = new IntVector(“vector”, allocator)) { ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5); output = doSomethingWith(input); assertThat(output).isEqualTo(expected); } {quote} was: There is a lot of verbosity in the construction of Arrays for testing purposes (multiple lines of setSafe(...) or set(...). We should start adding some static factory methods to make test setup clearer and more concise. A strawman proposal for BigIntVector might look like: static BigIntVector create(String name, BufferAllocator allocator, Long... values). Usage would be something like: try (BigIntVector input = BigIntVectorCreate("sample_data", allocator, 1235L, null, 456L), BigIntVector expected = BigIntVectorCreate("sample_data", allocator, 1L, null, 0L),) { output = doSomethingWith(input); assertThat(output).isEqualTo(expected); } > [Java] Create utility class for populating vector values used for test > purpose only > --- > > Key: ARROW-6930 > URL: https://issues.apache.org/jira/browse/ARROW-6930 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 11h 10m > Remaining Estimate: 0h > > There is a lot of verbosity in the construction of Arrays for testing > purposes (multiple lines of setSafe(...) or set(...). > We should start adding a utility class to make test setup clearer and more > concise, note this class should be located in arrow-vector test package and > could be used in other module’s testing by adding dependency: > {{}} > {{org.apache.arrow}} > {{arrow-vector}} > {{${project.version}}} > {{tests}} > {{test-jar}} > {{test}} > {{}} > Usage would be something like: > {quote}try (IntVector vector = new IntVector(“vector”, allocator)) { > ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5); > output = doSomethingWith(input); > assertThat(output).isEqualTo(expected); > } > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6997) [Packaging] Add support for RHEL
[ https://issues.apache.org/jira/browse/ARROW-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-6997. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5823 [https://github.com/apache/arrow/pull/5823] > [Packaging] Add support for RHEL > > > Key: ARROW-6997 > URL: https://issues.apache.org/jira/browse/ARROW-6997 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as > {{7Server}} from {{7}}. (Is it available on BinTray?) > We also need to update install information. We can't install {{epel-release}} > by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum > install > https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See > https://fedoraproject.org/wiki/EPEL for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7157) [R] Add validation, helpful error message to Object$new()
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7157. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5839 [https://github.com/apache/arrow/pull/5839] > [R] Add validation, helpful error message to Object$new() > - > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Neal Richardson >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7170) [C++] Bundled ORC fails linking
[ https://issues.apache.org/jira/browse/ARROW-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-7170: --- Assignee: Antoine Pitrou > [C++] Bundled ORC fails linking > --- > > Key: ARROW-7170 > URL: https://issues.apache.org/jira/browse/ARROW-7170 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > This shows up when building the tests as well: > {code} > [1/2] Linking CXX executable debug/orc-adapter-test > FAILED: debug/orc-adapter-test > : && /usr/bin/ccache /usr/bin/clang++-7 -Qunused-arguments > -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation > -Wno-unused-parameter -Wno-unknown-warning-option -Werror > -Wno-unknown-warning-option -msse4.2 -maltivec -D_GLIBCXX_USE_CXX11_ABI=1 > -D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g -rdynamic > src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o -o > debug/orc-adapter-test > -Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl > debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 > orc_ep-install/lib/liborc.a > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl > double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a > /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so > /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm > -lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so > jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a > mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread > -lrt -Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && : > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284: > error: undefined reference to 'deflateInit2_' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232: > error: undefined reference to 'deflateReset' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254: > error: undefined reference to 'deflate' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291: > error: undefined reference to 'deflateEnd' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405: > error: undefined reference to 'inflateInit2_' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430: > error: undefined reference to 'inflateEnd' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471: > error: undefined reference to 'inflateReset' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477: > error: undefined reference to 'inflate' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820: > error: undefined reference to 'snappy::GetUncompressedLength(char const*, > unsigned long, unsigned long*)' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828: > error: undefined reference to 'snappy::RawUncompress(char const*, unsigned > long, char*)' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894: > error: undefined reference to 'LZ4_decompress_safe' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.
[ https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974624#comment-16974624 ] Kouhei Sutou commented on ARROW-7158: - We should use {{CMAKE_CXX_COMPILER_ID}} and {{CMAKE_CXX_COMPILER_VERSION}} instead of parsing version string from command line. * https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_ID.html#variable:CMAKE_%3CLANG%3E_COMPILER_ID * https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_COMPILER_VERSION.html > [C++][Visual Studio]Build config Error on non English Version visual studio. > > > Key: ARROW-7158 > URL: https://issues.apache.org/jira/browse/ARROW-7158 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Yiun Seungryong >Assignee: Kouhei Sutou >Priority: Minor > > * Build Config Error on Non English OS > * always show > {code:java} > Not supported MSVC compiler {code} > * > [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44] > There is a bug in the code below. > {code:java} > if(MSVC) > set(COMPILER_FAMILY "msvc") > if("${COMPILER_VERSION_FULL}" MATCHES > ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code} > * In my compiler the version display contains Korean. > {code:java} > Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code} > * Regular expression seems to need to be changed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.
[ https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-7158: --- Assignee: Kouhei Sutou > [C++][Visual Studio]Build config Error on non English Version visual studio. > > > Key: ARROW-7158 > URL: https://issues.apache.org/jira/browse/ARROW-7158 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Yiun Seungryong >Assignee: Kouhei Sutou >Priority: Minor > > * Build Config Error on Non English OS > * always show > {code:java} > Not supported MSVC compiler {code} > * > [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44] > There is a bug in the code below. > {code:java} > if(MSVC) > set(COMPILER_FAMILY "msvc") > if("${COMPILER_VERSION_FULL}" MATCHES > ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code} > * In my compiler the version display contains Korean. > {code:java} > Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code} > * Regular expression seems to need to be changed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString
[ https://issues.apache.org/jira/browse/ARROW-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7172: -- Labels: pull-request-available (was: ) > [C++][Dataset] Improve format of Expression::ToString > - > > Key: ARROW-7172 > URL: https://issues.apache.org/jira/browse/ARROW-7172 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > > Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read > {{"b"_ > int32(3)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7157) [R] Add validation, helpful error message to Object$new()
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7157: -- Labels: pull-request-available (was: ) > [R] Add validation, helpful error message to Object$new() > - > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Neal Richardson >Priority: Blocker > Labels: pull-request-available > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reopened ARROW-7157: Assignee: Neal Richardson > [R] RecordBatchFileReader - Crashes RStudio > --- > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Neal Richardson >Priority: Blocker > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7157) [R] Add validation, helpful error message to Object$new()
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7157: --- Summary: [R] Add validation, helpful error message to Object$new() (was: [R] RecordBatchFileReader - Crashes RStudio) > [R] Add validation, helpful error message to Object$new() > - > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Neal Richardson >Priority: Blocker > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-851) C++/Python: Check Boost/Arrow C++ABI for consistency
[ https://issues.apache.org/jira/browse/ARROW-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Korn resolved ARROW-851. Resolution: Fixed conda-forge moved to the new CXX ABI and thus this is no longer relevant. > C++/Python: Check Boost/Arrow C++ABI for consistency > > > Key: ARROW-851 > URL: https://issues.apache.org/jira/browse/ARROW-851 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Uwe Korn >Priority: Major > > When building with dependencies from conda-forge on a newer system with GCC, > the C++ ABI versions can differ. We need to ensure that the versions match > between Boost, arrow-cpp and pyarrow in our CMake scripts. > Depending on this, we may need to pass {{-D_GLIBCXX_USE_CXX11_ABI=0}} to > {{CMAKE_CXX_FLAGS}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6965) [C++][Dataset] Optionally expose partition keys as materialized columns
[ https://issues.apache.org/jira/browse/ARROW-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman reassigned ARROW-6965: --- Assignee: Ben Kietzman > [C++][Dataset] Optionally expose partition keys as materialized columns > --- > > Key: ARROW-6965 > URL: https://issues.apache.org/jira/browse/ARROW-6965 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Assignee: Ben Kietzman >Priority: Major > Labels: dataset > > This would be exposed in the DataSourceDiscovery as an option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7163) [Doc] Fix double-and typos
[ https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-7163: - Assignee: Brian Wignall > [Doc] Fix double-and typos > -- > > Key: ARROW-7163 > URL: https://issues.apache.org/jira/browse/ARROW-7163 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Neal Richardson >Assignee: Brian Wignall >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7069) [C++][Dataset] Replace ConstantPartitionScheme with PrefixDictionaryPartitionScheme
[ https://issues.apache.org/jira/browse/ARROW-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7069: -- Labels: pull-request-available (was: ) > [C++][Dataset] Replace ConstantPartitionScheme with > PrefixDictionaryPartitionScheme > --- > > Key: ARROW-7069 > URL: https://issues.apache.org/jira/browse/ARROW-7069 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > > ConstantPartitionScheme is not very useful, it'd be better to provide a > dictionary of prefixes which maps to provided partition expressions -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6636) [C++] Do not build C++ command line utilities by default
[ https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6636. --- Resolution: Fixed Issue resolved by pull request 5830 [https://github.com/apache/arrow/pull/5830] > [C++] Do not build C++ command line utilities by default > > > Key: ARROW-6636 > URL: https://issues.apache.org/jira/browse/ARROW-6636 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These > are mostly used for integration testing, so building unit or integration > tests should toggle this on automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6636) [C++] Do not build C++ command line utilities by default
[ https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6636: - Assignee: Antoine Pitrou > [C++] Do not build C++ command line utilities by default > > > Key: ARROW-6636 > URL: https://issues.apache.org/jira/browse/ARROW-6636 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These > are mostly used for integration testing, so building unit or integration > tests should toggle this on automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6635) [C++] Do not require glog for default build
[ https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6635. --- Resolution: Fixed Issue resolved by pull request 5829 [https://github.com/apache/arrow/pull/5829] > [C++] Do not require glog for default build > --- > > Key: ARROW-6635 > URL: https://issues.apache.org/jira/browse/ARROW-6635 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > We should change the default for {{ARROW_USE_GLOG}} to be off -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6635) [C++] Do not require glog for default build
[ https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6635: - Assignee: Antoine Pitrou > [C++] Do not require glog for default build > --- > > Key: ARROW-6635 > URL: https://issues.apache.org/jira/browse/ARROW-6635 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > We should change the default for {{ARROW_USE_GLOG}} to be off -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974511#comment-16974511 ] Joris Van den Bossche commented on ARROW-7168: -- [~buhrmann] thanks for the report. When passing a type like that, I agree it should be honoured. Some other observations: Also when it's not all-NaN, the specified type gets ignored: {code} In [19]: cat = pd.Categorical(['a', 'b']) In [20]: typ = pa.dictionary(index_type=pa.int8(), value_type=pa.int64(), ordered=False) In [21]: pa.array(cat, type=typ) Out[21]: -- dictionary: [ "a", "b" ] -- indices: [ 0, 1 ] In [22]: pa.array(cat, type=typ).type Out[22]: DictionaryType(dictionary) {code} So I suppose it's a more general problem, not specifically related to this all-NaN case (it only appears for you in this case, as otherwise the specified type and the type from the data will probably match). In the example I show here above, we should probably raise an error is the specified type is not compatible (string vs int categories). > [Python] pa.array() doesn't respect provided dictionary type with all NaNs > -- > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7168: - Summary: [Python] pa.array() doesn't respect specified dictionary type (was: [Python] pa.array() doesn't respect provided dictionary type with all NaNs) > [Python] pa.array() doesn't respect specified dictionary type > - > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7168) [Python] pa.array() doesn't respect provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-7168: - Summary: [Python] pa.array() doesn't respect provided dictionary type with all NaNs (was: pa.array() doesn't respect provided dictionary type with all NaNs) > [Python] pa.array() doesn't respect provided dictionary type with all NaNs > -- > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None, None]).astype('object').astype('category') > typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), > ordered=False) > pa.array(ser, type=typ).type > {code} > results in > {noformat} > >> DictionaryType(dictionary) > {noformat} > which means that one cannot e.g. serialize batches of categoricals if the > possibility of all-NaN batches exists, even when trying to enforce that each > batch has the same schema (because the schema is not respected). > I understand that inferring the type in this case would be difficult, but I'd > imagine that a fully specified type should be respected in this case? > In the meantime, is there a workaround to manually create a dictionary array > of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7171) [Ruby] Pass Array for Arrow::Table#filter
[ https://issues.apache.org/jira/browse/ARROW-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7171: -- Labels: pull-request-available (was: ) > [Ruby] Pass Array for Arrow::Table#filter > -- > > Key: ARROW-7171 > URL: https://issues.apache.org/jira/browse/ARROW-7171 > Project: Apache Arrow > Issue Type: New Feature > Components: Ruby >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString
Ben Kietzman created ARROW-7172: --- Summary: [C++][Dataset] Improve format of Expression::ToString Key: ARROW-7172 URL: https://issues.apache.org/jira/browse/ARROW-7172 Project: Apache Arrow Issue Type: Improvement Components: C++, C++ - Dataset Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read {{"b"_ > int32(3)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7172) [C++][Dataset] Improve format of Expression::ToString
[ https://issues.apache.org/jira/browse/ARROW-7172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974491#comment-16974491 ] Ben Kietzman commented on ARROW-7172: - [~npr] > [C++][Dataset] Improve format of Expression::ToString > - > > Key: ARROW-7172 > URL: https://issues.apache.org/jira/browse/ARROW-7172 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Minor > Fix For: 1.0.0 > > > Instead of {{GREATER(FIELD(b), SCALAR(3))}}, these could just read > {{"b"_ > int32(3)}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6967) [C++] Add filter expressions for IN, COALESCE, and DROP_NULL
[ https://issues.apache.org/jira/browse/ARROW-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6967: -- Labels: pull-request-available (was: ) > [C++] Add filter expressions for IN, COALESCE, and DROP_NULL > > > Key: ARROW-6967 > URL: https://issues.apache.org/jira/browse/ARROW-6967 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Implement filter expressions for {{IN, COALESCE, DROP_NULL}} > {{IN}} should be backed in TreeEvaluator by the IsIn kernel. The others > should be initially implemented in terms of logical and filter kernels until > a specialized kernel is added for those. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7171) [Ruby] Pass Array for Arrow::Table#filter
Yosuke Shiro created ARROW-7171: --- Summary: [Ruby] Pass Array for Arrow::Table#filter Key: ARROW-7171 URL: https://issues.apache.org/jira/browse/ARROW-7171 Project: Apache Arrow Issue Type: New Feature Components: Ruby Reporter: Yosuke Shiro Assignee: Yosuke Shiro -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++
[ https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974463#comment-16974463 ] Deepak Majeti commented on ARROW-2496: -- Not anytime soon. We are still working on improving the robustness of Libhdfs++. > [C++] Add support for Libhdfs++ > --- > > Key: ARROW-2496 > URL: https://issues.apache.org/jira/browse/ARROW-2496 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Deepak Majeti >Assignee: Deepak Majeti >Priority: Major > Labels: HDFS > > Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS > project. Details are available here. > https://issues.apache.org/jira/browse/HDFS-8707 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6749) [Python] Conversion of non-ns timestamp array to numpy gives wrong values
[ https://issues.apache.org/jira/browse/ARROW-6749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-6749. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5718 [https://github.com/apache/arrow/pull/5718] > [Python] Conversion of non-ns timestamp array to numpy gives wrong values > - > > Key: ARROW-6749 > URL: https://issues.apache.org/jira/browse/ARROW-6749 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > {code} > In [25]: np_arr = np.arange("2012-01-01", "2012-01-06", int(1e6)*60*60*24, > dtype="datetime64[us]") > > In [26]: np_arr > > > Out[26]: > array(['2012-01-01T00:00:00.00', '2012-01-02T00:00:00.00', >'2012-01-03T00:00:00.00', '2012-01-04T00:00:00.00', >'2012-01-05T00:00:00.00'], dtype='datetime64[us]') > In [27]: arr = pa.array(np_arr) > > > In [28]: arr > > > Out[28]: > > [ > 2012-01-01 00:00:00.00, > 2012-01-02 00:00:00.00, > 2012-01-03 00:00:00.00, > 2012-01-04 00:00:00.00, > 2012-01-05 00:00:00.00 > ] > In [29]: arr.type > > > Out[29]: TimestampType(timestamp[us]) > In [30]: arr.to_numpy() > > > Out[30]: > array(['1970-01-16T08:09:36.0', '1970-01-16T08:11:02.4', >'1970-01-16T08:12:28.8', '1970-01-16T08:13:55.2', >'1970-01-16T08:15:21.6'], dtype='datetime64[ns]') > {code} > So it seems to simply interpret the integer microsecond values as nanoseconds > when converting to numpy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type
[ https://issues.apache.org/jira/browse/ARROW-6176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6176: -- Labels: pull-request-available (was: ) > [Python] Allow to subclass ExtensionArray to attach to custom extension type > > > Key: ARROW-6176 > URL: https://issues.apache.org/jira/browse/ARROW-6176 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Currently, you can define a custom extension type in Python with > {code} > class UuidType(pa.ExtensionType): > def __init__(self): > pa.ExtensionType.__init__(self, pa.binary(16)) > def __reduce__(self): > return UuidType, () > {code} > but the array you can create with this is always ExtensionArray. We should > provide a way to define a subclass (eg `UuidArray` in this case) that can > hold custom logic. > For example, a user might want to define `UuidArray` such that `arr[i]` > returns an instance of Python's `uuid.UUID` > From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs
[ https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974427#comment-16974427 ] Thomas Buhrmann commented on ARROW-7168: Since I'm already at it, and in case somebody faces the same problem... To safely convert pandas categoricals to arrow, ensuring a constant type across batches, something like the following would work: {code:python} def categorical_to_arrow(ser, known_categories=None, ordered=None): """Safely create a pa.array from a categorial pd.Series. Args: ser (pd.Series): should be of CategorialDtype known_categories (np.array): force known categories. If None, and the Series doesn't have any values to infer it from, will use an empty array of the same dtype as the categories attribute of the Series ordered (bool): whether categories should be ordered """ n = len(ser) all_nan = ser.isna().sum() == n # Enforce provided categories, use the original ones, or enforce # the correct value_type if Arrow would otherwise change it to 'null' if isinstance(known_categories, np.ndarray): dictionary = known_categories elif all_nan: # value_type may be known, but Arrow doesn't understand 'object' dtype value_type = ser.cat.categories.dtype if value_type == 'object': value_type = 'str' dictionary = np.array([], dtype=value_type) else: dictionary = ser.cat.categories # Allow overwriting of ordered attribute if ordered is None: ordered = ser.cat.ordered if all_nan: return pa.DictionaryArray.from_arrays( indices=-np.ones(n, dtype=ser.cat.codes.dtype), dictionary=dictionary, mask=np.ones(n, dtype='bool'), ordered=ordered) else: return pa.DictionaryArray.from_arrays( indices=ser.cat.codes, dictionary=dictionary, ordered=ordered, from_pandas=True ) {code} This seems to be the only ( ?) way to have control over the resulting dictionary type. E.g.: {code:python} # String categories with and without non-NaN values sers = [ pd.Series([None, None]).astype('object').astype('category'), pd.Series(['a', None, None]).astype('category') ] # The categorical types we may want known_categories = [ None, np.array(['a', 'b', 'c'], dtype='str'), np.array([1, 2, 3], dtype='int8') ] # Convert each series with each of the desired category types for ser in sers: for cats in categories: arr = categorical_to_arrow(ser, known_categories=cats) ser2 = pd.Series(arr.to_pandas()) print(f"Series: {list(ser)} | Known categories: {cats}") print(f"Dictionary type: {arr.type}") print(f"Roundtripped Series: \n{ser2}", "\n") {code} which produces: {noformat} Series: [nan, nan] | Known categories: None Dictionary type: dictionary Roundtripped Series: 0NaN 1NaN dtype: category Categories (0, object): [] Series: [nan, nan] | Known categories: ['a' 'b' 'c'] Dictionary type: dictionary Roundtripped Series: 0NaN 1NaN dtype: category Categories (3, object): [a, b, c] Series: [nan, nan] | Known categories: [1 2 3] Dictionary type: dictionary Roundtripped Series: 0NaN 1NaN dtype: category Categories (3, int64): [1, 2, 3] Series: ['a', nan, nan] | Known categories: None Dictionary type: dictionary Roundtripped Series: 0 a 1NaN 2NaN dtype: category Categories (1, object): [a] Series: ['a', nan, nan] | Known categories: ['a' 'b' 'c'] Dictionary type: dictionary Roundtripped Series: 0 a 1NaN 2NaN dtype: category Categories (3, object): [a, b, c] Series: ['a', nan, nan] | Known categories: [1 2 3] Dictionary type: dictionary Roundtripped Series: 0 1 1NaN 2NaN dtype: category Categories (3, int64): [1, 2, 3] {noformat} (the last example would correspond to a recoding of the categories, but that'd be a usage problem...) > pa.array() doesn't respect provided dictionary type with all NaNs > - > > Key: ARROW-7168 > URL: https://issues.apache.org/jira/browse/ARROW-7168 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.15.1 >Reporter: Thomas Buhrmann >Priority: Major > > This might be related to ARROW-6548 and others dealing with all NaN columns. > When creating a dictionary array, even when fully specifying the desired > type, this type is not respected when the data contains only NaNs: > {code:python} > # This may look a little artificial but easily occurs when processing > categorial data in batches and a particular batch containing only NaNs > ser = pd.Series([None,
[jira] [Updated] (ARROW-7170) [C++] Bundled ORC fails linking
[ https://issues.apache.org/jira/browse/ARROW-7170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7170: -- Labels: pull-request-available (was: ) > [C++] Bundled ORC fails linking > --- > > Key: ARROW-7170 > URL: https://issues.apache.org/jira/browse/ARROW-7170 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > This shows up when building the tests as well: > {code} > [1/2] Linking CXX executable debug/orc-adapter-test > FAILED: debug/orc-adapter-test > : && /usr/bin/ccache /usr/bin/clang++-7 -Qunused-arguments > -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation > -Wno-unused-parameter -Wno-unknown-warning-option -Werror > -Wno-unknown-warning-option -msse4.2 -maltivec -D_GLIBCXX_USE_CXX11_ABI=1 > -D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g -rdynamic > src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o -o > debug/orc-adapter-test > -Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl > debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 > orc_ep-install/lib/liborc.a > /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl > double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a > /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so > /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a > /home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 > /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm > -lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so > jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a > mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread > -lrt -Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && : > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284: > error: undefined reference to 'deflateInit2_' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232: > error: undefined reference to 'deflateReset' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254: > error: undefined reference to 'deflate' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291: > error: undefined reference to 'deflateEnd' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405: > error: undefined reference to 'inflateInit2_' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430: > error: undefined reference to 'inflateEnd' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471: > error: undefined reference to 'inflateReset' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477: > error: undefined reference to 'inflate' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820: > error: undefined reference to 'snappy::GetUncompressedLength(char const*, > unsigned long, unsigned long*)' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828: > error: undefined reference to 'snappy::RawUncompress(char const*, unsigned > long, char*)' > /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894: > error: undefined reference to 'LZ4_decompress_safe' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7170) [C++] Bundled ORC fails linking
Antoine Pitrou created ARROW-7170: - Summary: [C++] Bundled ORC fails linking Key: ARROW-7170 URL: https://issues.apache.org/jira/browse/ARROW-7170 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou This shows up when building the tests as well: {code} [1/2] Linking CXX executable debug/orc-adapter-test FAILED: debug/orc-adapter-test : && /usr/bin/ccache /usr/bin/clang++-7 -Qunused-arguments -fcolor-diagnostics -fuse-ld=gold -ggdb -O0 -Wall -Wextra -Wdocumentation -Wno-unused-parameter -Wno-unknown-warning-option -Werror -Wno-unknown-warning-option -msse4.2 -maltivec -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 -fno-omit-frame-pointer -g -rdynamic src/arrow/adapters/orc/CMakeFiles/orc-adapter-test.dir/adapter_test.cc.o -o debug/orc-adapter-test -Wl,-rpath,/home/antoine/arrow/dev/cpp/build-test/debug:/home/antoine/miniconda3/envs/pyarrow/lib /home/antoine/miniconda3/envs/pyarrow/lib/libgtest_main.so /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -lpthread -ldl debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 orc_ep-install/lib/liborc.a /home/antoine/miniconda3/envs/pyarrow/lib/libgtest.so -ldl double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a /home/antoine/miniconda3/envs/pyarrow/lib/libssl.so /home/antoine/miniconda3/envs/pyarrow/lib/libcrypto.so /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlienc-static.a /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlidec-static.a /home/antoine/miniconda3/envs/pyarrow/lib/libbrotlicommon-static.a /home/antoine/miniconda3/envs/pyarrow/lib/libprotobuf.so /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-config.so /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-transfer.so /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-s3.so /home/antoine/miniconda3/envs/pyarrow/lib/libaws-cpp-sdk-core.so /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-event-stream.so.1.0.0 /home/antoine/miniconda3/envs/pyarrow/lib/libaws-c-common.so.1.0.0 -lm -lpthread /home/antoine/miniconda3/envs/pyarrow/lib/libaws-checksums.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a mimalloc_ep/src/mimalloc_ep/lib/mimalloc-1.0/libmimalloc-debug.a -pthread -lrt -Wl,-rpath-link,/home/antoine/miniconda3/envs/pyarrow/lib && : /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:284: error: undefined reference to 'deflateInit2_' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:232: error: undefined reference to 'deflateReset' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:254: error: undefined reference to 'deflate' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:291: error: undefined reference to 'deflateEnd' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:405: error: undefined reference to 'inflateInit2_' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:430: error: undefined reference to 'inflateEnd' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:471: error: undefined reference to 'inflateReset' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:477: error: undefined reference to 'inflate' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:820: error: undefined reference to 'snappy::GetUncompressedLength(char const*, unsigned long, unsigned long*)' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:828: error: undefined reference to 'snappy::RawUncompress(char const*, unsigned long, char*)' /home/antoine/arrow/dev/cpp/build-test/orc_ep-prefix/src/orc_ep/c++/src/Compression.cc:894: error: undefined reference to 'LZ4_decompress_safe' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7169) [C++] Vendor uriparser library
Antoine Pitrou created ARROW-7169: - Summary: [C++] Vendor uriparser library Key: ARROW-7169 URL: https://issues.apache.org/jira/browse/ARROW-7169 Project: Apache Arrow Issue Type: Wish Components: C++ Reporter: Antoine Pitrou The [uriparser C library|https://github.com/uriparser/uriparser] is used internally for URI parsing. Instead of having an explicit dependency, we could simply vendor it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7169) [C++] Vendor uriparser library
[ https://issues.apache.org/jira/browse/ARROW-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974301#comment-16974301 ] Antoine Pitrou commented on ARROW-7169: --- [~npr] This might ease the work for R a bit. > [C++] Vendor uriparser library > -- > > Key: ARROW-7169 > URL: https://issues.apache.org/jira/browse/ARROW-7169 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > > The [uriparser C library|https://github.com/uriparser/uriparser] is used > internally for URI parsing. Instead of having an explicit dependency, we > could simply vendor it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6633) [C++] Do not require double-conversion for default build
[ https://issues.apache.org/jira/browse/ARROW-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6633: -- Labels: pull-request-available (was: ) > [C++] Do not require double-conversion for default build > > > Key: ARROW-6633 > URL: https://issues.apache.org/jira/browse/ARROW-6633 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This library is only needed in core builds if > * ARROW_JSON=on or > * ARROW_CSV=on (option to be added) or > * ARROW_BUILD_TESTS=on > The double conversion headers leak into > * arrow/util/decimal.h -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
[ https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-7162: - Assignee: Antoine Pitrou > [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake > --- > > Key: ARROW-7162 > URL: https://issues.apache.org/jira/browse/ARROW-7162 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > For clang we currently disable a lot of warnings explicitly. This dates back > to when we enabled {{-Weverything}}. We should probably remove most or all of > these flags now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
[ https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-7162. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5828 [https://github.com/apache/arrow/pull/5828] > [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake > --- > > Key: ARROW-7162 > URL: https://issues.apache.org/jira/browse/ARROW-7162 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > For clang we currently disable a lot of warnings explicitly. This dates back > to when we enabled {{-Weverything}}. We should probably remove most or all of > these flags now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6633) [C++] Do not require double-conversion for default build
[ https://issues.apache.org/jira/browse/ARROW-6633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-6633: - Assignee: Antoine Pitrou > [C++] Do not require double-conversion for default build > > > Key: ARROW-6633 > URL: https://issues.apache.org/jira/browse/ARROW-6633 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > This library is only needed in core builds if > * ARROW_JSON=on or > * ARROW_CSV=on (option to be added) or > * ARROW_BUILD_TESTS=on > The double conversion headers leak into > * arrow/util/decimal.h -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing
[ https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974282#comment-16974282 ] Wes McKinney commented on ARROW-6818: - I recently spent a lot of time on these documents. It's time for someone else to take a turn. > [Doc] Format docs confusing > --- > > Key: ARROW-6818 > URL: https://issues.apache.org/jira/browse/ARROW-6818 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Format >Reporter: Antoine Pitrou >Priority: Major > > I find there are several issues in the format docs. > 1) there is a claimed distinction between "logical types" and "physical > types", but the "physical types" actually lists logical types such as Map > 2) the "logical types" document doesn't actually list logical types, it just > sends to the flatbuffers file. One shouldn't have to read a flatbuffers file > to understand the Arrow format. > 3) some terminology seems unusual, such as "relative type" > 4) why is there a link to the Apache Drill docs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
[ https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974270#comment-16974270 ] SURESH CHAGANTI commented on ARROW-4890: thank you [~emkornfi...@gmail.com] appreciate your time, I have ran multiple tests with different sizes of the data and shard with more than 2GB size was failed. glad to see the fix is in progress and it will be great if this change rolls out sooner. I will be happy to contribute to this issue. > [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1 > - > > Key: ARROW-4890 > URL: https://issues.apache.org/jira/browse/ARROW-4890 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Cloudera cdh5.13.3 > Cloudera Spark 2.3.0.cloudera3 >Reporter: Abdeali Kothari >Priority: Major > Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png > > > Creating this in Arrow project as the traceback seems to suggest this is an > issue in Arrow. > Continuation from the conversation on the > https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E > When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error: > {noformat} > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", > line 279, in load_stream > for batch in reader: > File "pyarrow/ipc.pxi", line 265, in __iter__ > File "pyarrow/ipc.pxi", line 281, in > pyarrow.lib._RecordBatchReader.read_next_batch > File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: read length must be positive or -1 > {noformat} > as my dataset size starts increasing that I want to group on. Here is a > reproducible code snippet where I can reproduce this. > Note: My actual dataset is much larger and has many more unique IDs and is a > valid usecase where I cannot simplify this groupby in any way. I have > stripped out all the logic to make this example as simple as I could. > {code:java} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell' > import findspark > findspark.init() > import pyspark > from pyspark.sql import functions as F, types as T > import pandas as pd > spark = pyspark.sql.SparkSession.builder.getOrCreate() > pdf1 = pd.DataFrame( > [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]], > columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4'] > ) > df1 = spark.createDataFrame(pd.concat([pdf1 for i in > range(429)]).reset_index()).drop('index') > pdf2 = pd.DataFrame( > [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", > "abcdefghijklmno"]], > columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6'] > ) > df2 = spark.createDataFrame(pd.concat([pdf2 for i in > range(48993)]).reset_index()).drop('index') > df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner') > def myudf(df): > return df > df4 = df3 > udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf) > df5 = df4.groupBy('df1_c1').apply(udf) > print('df5.count()', df5.count()) > # df5.write.parquet('/tmp/temp.parquet', mode='overwrite') > {code} > I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per > executor too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6636) [C++] Do not build C++ command line utilities by default
[ https://issues.apache.org/jira/browse/ARROW-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6636: -- Labels: pull-request-available (was: ) > [C++] Do not build C++ command line utilities by default > > > Key: ARROW-6636 > URL: https://issues.apache.org/jira/browse/ARROW-6636 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This means to change {{ARROW_BUILD_UTILITIES}} to be off by default. These > are mostly used for integration testing, so building unit or integration > tests should toggle this on automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6635) [C++] Do not require glog for default build
[ https://issues.apache.org/jira/browse/ARROW-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6635: -- Labels: pull-request-available (was: ) > [C++] Do not require glog for default build > --- > > Key: ARROW-6635 > URL: https://issues.apache.org/jira/browse/ARROW-6635 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We should change the default for {{ARROW_USE_GLOG}} to be off -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code
[ https://issues.apache.org/jira/browse/ARROW-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974243#comment-16974243 ] Antoine Pitrou commented on ARROW-2196: --- Is there something left to do here? > [C++] Consider quarantining platform code with dependency on non-header Boost > code > -- > > Key: ARROW-2196 > URL: https://issues.apache.org/jira/browse/ARROW-2196 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > see discussion in ARROW-2193 for the motivation -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2078) [C++] Consider making DataType and its subclasses noncopyable
[ https://issues.apache.org/jira/browse/ARROW-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974242#comment-16974242 ] Antoine Pitrou commented on ARROW-2078: --- Actually, we already have: {code:c++} ARROW_DISALLOW_COPY_AND_ASSIGN(DataType); {code} Does this issue allude at something else? What exactly? > [C++] Consider making DataType and its subclasses noncopyable > - > > Key: ARROW-2078 > URL: https://issues.apache.org/jira/browse/ARROW-2078 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6257) [C++] Add fnmatch compatible globbing function
[ https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6257. --- Assignee: (was: Ben Kietzman) Resolution: Won't Fix > [C++] Add fnmatch compatible globbing function > -- > > Key: ARROW-6257 > URL: https://issues.apache.org/jira/browse/ARROW-6257 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > This will be useful for the filesystems module and in datasource discovery, > which uses it. > Behavior should be compatible with > http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6320) [C++] Arrow utilities are linked statically
[ https://issues.apache.org/jira/browse/ARROW-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-6320. - Resolution: Done This seems to have been fixed at some point. > [C++] Arrow utilities are linked statically > --- > > Key: ARROW-6320 > URL: https://issues.apache.org/jira/browse/ARROW-6320 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > Even though other executables are linked dynamically with {{libarrow}} and > friends, the arrow utilities are linked statically on Linux: > {code} > $ ldd build-test/debug/arrow-stream-to-file > linux-vdso.so.1 (0x7ffe353a8000) > libboost_filesystem.so.1.67.0 => > /home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 > (0x7f7baf7a1000) > libboost_system.so.1.67.0 => > /home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.67.0 > (0x7f7baf59c000) > libstdc++.so.6 => > /home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f7bb0522000) > libgcc_s.so.1 => > /home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f7bb050e000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x7f7baf37d000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f7baef8c000) > /lib64/ld-linux-x86-64.so.2 (0x7f7bb0471000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f7baed84000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f7bae9e6000) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types
[ https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974235#comment-16974235 ] Antoine Pitrou commented on ARROW-4799: --- Well, neither does the Expr / Operation hierarchy in compute, AFAICT. > [C++] Propose alternative strategy for handling Operation logical output types > -- > > Key: ARROW-4799 > URL: https://issues.apache.org/jira/browse/ARROW-4799 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Currently in the prototype work in ARROW-4782, operations are being "boxed" > in a strongly typed Expr types. An alternative structure would be for an > operation to define a virtual > {code} > virtual std::shared_ptr out_type() const = 0; > {code} > Where {{ArgType}} is some class that encodes the arity (array vs. scalar > vs) and value type (if any) that is emitted by the operation. > Operations emitting multiple pieces of data would need some kind of "tuple" > object output. We can iterate on this -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6257) [C++] Add fnmatch compatible globbing function
[ https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974232#comment-16974232 ] Antoine Pitrou commented on ARROW-6257: --- Is this still relevant? I don't think being POSIX-fnmatch-compatible is important here. [~fsaintjacques] > [C++] Add fnmatch compatible globbing function > -- > > Key: ARROW-6257 > URL: https://issues.apache.org/jira/browse/ARROW-6257 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > This will be useful for the filesystems module and in datasource discovery, > which uses it. > Behavior should be compatible with > http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types
[ https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974231#comment-16974231 ] Francois Saint-Jacques commented on ARROW-4799: --- This is still relevant. The expression class in dataset does not represent logical (relational) operations. > [C++] Propose alternative strategy for handling Operation logical output types > -- > > Key: ARROW-4799 > URL: https://issues.apache.org/jira/browse/ARROW-4799 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Currently in the prototype work in ARROW-4782, operations are being "boxed" > in a strongly typed Expr types. An alternative structure would be for an > operation to define a virtual > {code} > virtual std::shared_ptr out_type() const = 0; > {code} > Where {{ArgType}} is some class that encodes the arity (array vs. scalar > vs) and value type (if any) that is emitted by the operation. > Operations emitting multiple pieces of data would need some kind of "tuple" > object output. We can iterate on this -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6612) [C++] Add ARROW_CSV CMake build flag
[ https://issues.apache.org/jira/browse/ARROW-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974230#comment-16974230 ] Antoine Pitrou commented on ARROW-6612: --- I'd rather close this as won't fix. Our build system is already too convoluted. > [C++] Add ARROW_CSV CMake build flag > > > Key: ARROW-6612 > URL: https://issues.apache.org/jira/browse/ARROW-6612 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > I think it would be better to make building this part of the project not > unconditional -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage
[ https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974229#comment-16974229 ] Antoine Pitrou commented on ARROW-1292: --- Is this still relevant? Does it make sense to focus efforts on HDFS? > [C++/Python] Expand libhdfs feature coverage > > > Key: ARROW-1292 > URL: https://issues.apache.org/jira/browse/ARROW-1292 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: filesystem > Fix For: 1.0.0 > > > Umbrella JIRA. Will create child issues for more granular tasks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6615) [C++] Add filtering option to fs::Selector
[ https://issues.apache.org/jira/browse/ARROW-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques resolved ARROW-6615. --- Resolution: Won't Fix > [C++] Add filtering option to fs::Selector > -- > > Key: ARROW-6615 > URL: https://issues.apache.org/jira/browse/ARROW-6615 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > It would convenient if Selector could support file path filtering, either via > a regex or globbing applied to the path. > This is semi required for filtering file in Dataset to properly apply the > file format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-1403) [C++] Add variant of SerializeRecordBatch that accepts an writer-allocator callback
[ https://issues.apache.org/jira/browse/ARROW-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-1403. - Resolution: Abandoned Closing as outdated. > [C++] Add variant of SerializeRecordBatch that accepts an writer-allocator > callback > --- > > Key: ARROW-1403 > URL: https://issues.apache.org/jira/browse/ARROW-1403 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > When writing to other kinds of interfaces, like GPU, it would be useful to be > able to pass a function or closure that can instantiate an instance of > {{OutputStream}} (which might write to CPU memory, GPU, etc.) given the > computed size of the record batch. Currently we allocate new CPU memory and > write to that buffer, but this would eliminate an intermediate copy. > So something like > {code} > typedef std::function*)> > StreamCreator; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-1445) [Python] Segfault when using libhdfs3 in pyarrow using latest API
[ https://issues.apache.org/jira/browse/ARROW-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-1445. - Resolution: Abandoned Closing as outdated. Feel free to open a new issue if you still experience this. > [Python] Segfault when using libhdfs3 in pyarrow using latest API > - > > Key: ARROW-1445 > URL: https://issues.apache.org/jira/browse/ARROW-1445 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.6.0 >Reporter: James Porritt >Priority: Major > > I'm encoutering a segfault when using libhdfs3 with pyarrow. > My script is: > {code} > import pyarrow > def main(): > hdfs = pyarrow.hdfs.connect("", , "", > driver='libhdfs') > print hdfs.ls('') > hdfs3a = pyarrow.HdfsClient("", , "", > driver='libhdfs3') > print hdfs3a.ls('') > hdfs3b = pyarrow.hdfs.connect("", , "", > driver='libhdfs3') > print hdfs3b.ls('') > main() > {code} > The first two hdfs connections yield the correct list. The third yields: > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f69c0c8b57f, pid=88070, tid=140092200666880 > # > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build > 1.8.0_60-b27) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode > linux-amd64 compressed oops) > # Problematic frame: > # C [libc.so.6+0x13357f] __strlen_sse42+0xf > {noformat} > It dumps an error report file too. > I created my conda environment with: > {noformat} > conda create -n parquet > source activate parquet > conda install pyarrow libhdfs3 -c conda-forge > {noformat} > The packages used are: > {noformat} > arrow-cpp 0.6.0 np113py27_1conda-forge > boost-cpp 1.64.01conda-forge > bzip2 1.0.6 1conda-forge > ca-certificates 2017.7.27.1 0conda-forge > certifi 2017.7.27.1 py27_0conda-forge > curl 7.54.10conda-forge > icu 58.1 1conda-forge > krb5 1.14.20conda-forge > libgcrypt 1.8.0 0conda-forge > libgpg-error 1.27 0conda-forge > libgsasl 1.8.0 1conda-forge > libhdfs3 2.3 0conda-forge > libiconv 1.14 4conda-forge > libntlm 1.4 0conda-forge > libssh2 1.8.0 1conda-forge > libuuid 1.0.3 1conda-forge > libxml2 2.9.4 4conda-forge > mkl 2017.0.3 0 > ncurses 5.9 10conda-forge > numpy 1.13.1 py27_0 > openssl 1.0.2l0conda-forge > pandas0.20.3 py27_1conda-forge > parquet-cpp 1.3.0.pre 1conda-forge > pip 9.0.1py27_0conda-forge > protobuf 3.3.2py27_0conda-forge > pyarrow 0.6.0 np113py27_1conda-forge > python2.7.131conda-forge > python-dateutil 2.6.1py27_0conda-forge > pytz 2017.2 py27_0conda-forge > readline 6.2 0conda-forge > setuptools36.2.2 py27_0conda-forge > six 1.10.0 py27_1conda-forge > sqlite3.13.01conda-forge > tk8.5.192conda-forge > wheel 0.29.0 py27_0conda-forge > xz5.2.3 0conda-forge > zlib 1.2.110conda-forge > {noformat} > I've set my ARROW_LIBHDFS_DIR to point at the location of the libhdfs3.so > file. > I've populated my CLASSPATH as per the documentation. > Please advise. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6818) [Doc] Format docs confusing
[ https://issues.apache.org/jira/browse/ARROW-6818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974217#comment-16974217 ] Antoine Pitrou commented on ARROW-6818: --- cc [~wesm] [~emkornfield] > [Doc] Format docs confusing > --- > > Key: ARROW-6818 > URL: https://issues.apache.org/jira/browse/ARROW-6818 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation, Format >Reporter: Antoine Pitrou >Priority: Major > > I find there are several issues in the format docs. > 1) there is a claimed distinction between "logical types" and "physical > types", but the "physical types" actually lists logical types such as Map > 2) the "logical types" document doesn't actually list logical types, it just > sends to the flatbuffers file. One shouldn't have to read a flatbuffers file > to understand the Arrow format. > 3) some terminology seems unusual, such as "relative type" > 4) why is there a link to the Apache Drill docs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6511) [Developer] Remove run_docker_compose.sh
[ https://issues.apache.org/jira/browse/ARROW-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-6511. - Resolution: Done It seems the script was removed in the meantime. > [Developer] Remove run_docker_compose.sh > > > Key: ARROW-6511 > URL: https://issues.apache.org/jira/browse/ARROW-6511 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Ben Kietzman >Priority: Trivial > > dev/run_docker_compose.sh and Makefile.docker perform fundamentally the same > function: run docker-compose conveniently. Consolidating them is probably > worthwhile, and since Makefile.docker also builds dependencies it seems the > natural choice. > update dev/README.md as well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-5431) [C++] Protobuf fails building on Windows
[ https://issues.apache.org/jira/browse/ARROW-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-5431. - Resolution: Abandoned > [C++] Protobuf fails building on Windows > > > Key: ARROW-5431 > URL: https://issues.apache.org/jira/browse/ARROW-5431 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > > Looks like this part of our build chain assumes Unix: > {code} > [55/489] Performing configure step for 'protobuf_ep' > FAILED: protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure > cmd.exe /C "cd /D > C:\t\arrow\cpp\build-debug\protobuf_ep-prefix\src\protobuf_ep > && C:\Miniconda3\envs\arrow\Library\bin\cmake.exe -P > C:/t/arrow/cpp/build-debug/ > protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure-DEBUG.cmake && > C: > \Miniconda3\envs\arrow\Library\bin\cmake.exe -E touch > C:/t/arrow/cpp/build-debug > /protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-configure" > CMake Error at > C:/t/arrow/cpp/build-debug/protobuf_ep-prefix/src/protobuf_ep-sta > mp/protobuf_ep-configure-DEBUG.cmake:49 (message): > Command failed: %1 is not a valid Win32 application >'./configure' 'AR=' 'RANLIB=' 'CC=C:/Program Files (x86)/Microsoft Visual > Stu > dio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx64/x64/cl.exe' > 'CXX=C:/Min > iconda3/envs/arrow/Scripts/clcache.exe' '--disable-shared' > '--prefix=C:/t/arrow/ > cpp/thirdparty/protobuf_ep-install' 'CFLAGS=/DWIN32 /D_WINDOWS /W3 /MDd /Zi > /Ob > 0 /Od /RTC1' 'CXXFLAGS=/DWIN32 /D_WINDOWS /GR /EHsc > /D_SILENCE_TR1_NAMESPACE_DE > PRECATION_WARNING /MDd /Zi /Ob0 /Od /RTC1' > See also > > C:/t/arrow/cpp/build-debug/protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf > _ep-configure-*.log > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4799) [C++] Propose alternative strategy for handling Operation logical output types
[ https://issues.apache.org/jira/browse/ARROW-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974205#comment-16974205 ] Antoine Pitrou commented on ARROW-4799: --- Is this still relevant? It seems like the expression layer from datasets should be used instead. cc [~fsaintjacques] > [C++] Propose alternative strategy for handling Operation logical output types > -- > > Key: ARROW-4799 > URL: https://issues.apache.org/jira/browse/ARROW-4799 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Currently in the prototype work in ARROW-4782, operations are being "boxed" > in a strongly typed Expr types. An alternative structure would be for an > operation to define a virtual > {code} > virtual std::shared_ptr out_type() const = 0; > {code} > Where {{ArgType}} is some class that encodes the arity (array vs. scalar > vs) and value type (if any) that is emitted by the operation. > Operations emitting multiple pieces of data would need some kind of "tuple" > object output. We can iterate on this -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs
Thomas Buhrmann created ARROW-7168: -- Summary: pa.array() doesn't respect provided dictionary type with all NaNs Key: ARROW-7168 URL: https://issues.apache.org/jira/browse/ARROW-7168 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.15.1 Reporter: Thomas Buhrmann This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs: {code:python} # This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs ser = pd.Series([None, None]).astype('object').astype('category') typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False) pa.array(ser, type=typ).type {code} results in {noformat} >> DictionaryType(dictionary) {noformat} which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected). I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case? In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6003) [C++] Better input validation and error messaging in CSV reader
[ https://issues.apache.org/jira/browse/ARROW-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6003: -- Component/s: R > [C++] Better input validation and error messaging in CSV reader > --- > > Key: ARROW-6003 > URL: https://issues.apache.org/jira/browse/ARROW-6003 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Neal Richardson >Priority: Major > Labels: csv > > Followup to https://issues.apache.org/jira/browse/ARROW-5747. The error > message(s) are not great when you give bad input. For example, if I give too > many or too few {{column_names}}, the error I get is {{Invalid: Empty CSV > file}}. In fact, that's about the only error message I've seen from the CSV > reader, no matter what I've thrown at it. > It would be better if error messages were more specific so that I as a user > might know how to fix my bad input. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1099) [C++] Add support for PFOR integer compression
[ https://issues.apache.org/jira/browse/ARROW-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974184#comment-16974184 ] Antoine Pitrou commented on ARROW-1099: --- Do we want to keep this open? > [C++] Add support for PFOR integer compression > -- > > Key: ARROW-1099 > URL: https://issues.apache.org/jira/browse/ARROW-1099 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > https://github.com/lemire/FastPFor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6615) [C++] Add filtering option to fs::Selector
[ https://issues.apache.org/jira/browse/ARROW-6615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974180#comment-16974180 ] Antoine Pitrou commented on ARROW-6615: --- Can we close this? > [C++] Add filtering option to fs::Selector > -- > > Key: ARROW-6615 > URL: https://issues.apache.org/jira/browse/ARROW-6615 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > > It would convenient if Selector could support file path filtering, either via > a regex or globbing applied to the path. > This is semi required for filtering file in Dataset to properly apply the > file format. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2496) [C++] Add support for Libhdfs++
[ https://issues.apache.org/jira/browse/ARROW-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974181#comment-16974181 ] Antoine Pitrou commented on ARROW-2496: --- Are you still willing to work on this? > [C++] Add support for Libhdfs++ > --- > > Key: ARROW-2496 > URL: https://issues.apache.org/jira/browse/ARROW-2496 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Deepak Majeti >Assignee: Deepak Majeti >Priority: Major > Labels: HDFS > > Libhdfs++ is an asynchronous pure C++ HDFS client. It is now part of the HDFS > project. Details are available here. > https://issues.apache.org/jira/browse/HDFS-8707 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4763) [C++/Python] Cannot build Gandiva in conda on OSX due to package conflicts
[ https://issues.apache.org/jira/browse/ARROW-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974178#comment-16974178 ] Antoine Pitrou commented on ARROW-4763: --- Is this still relevant? > [C++/Python] Cannot build Gandiva in conda on OSX due to package conflicts > -- > > Key: ARROW-4763 > URL: https://issues.apache.org/jira/browse/ARROW-4763 > Project: Apache Arrow > Issue Type: Bug > Components: C++, C++ - Gandiva, Python >Reporter: Uwe Korn >Priority: Major > > It is currently not reliably possible to build Gandiva with one of the conda > toolchains on OSX as the packages {{llvm==4.0.1}} (pulled in by the > compilers) and {{llvmdev==7.0.1}} conflict in some files: > https://github.com/conda-forge/llvmdev-feedstock/issues/60 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6972) [C#] Should support StructField arrays
[ https://issues.apache.org/jira/browse/ARROW-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6972: -- Summary: [C#] Should support StructField arrays (was: C# should support StructField arrays) > [C#] Should support StructField arrays > -- > > Key: ARROW-6972 > URL: https://issues.apache.org/jira/browse/ARROW-6972 > Project: Apache Arrow > Issue Type: New Feature > Components: C# >Reporter: Cameron Murray >Priority: Major > > The C# implementation of Arrow does not support struct arrays and, complex > types more generally. > I notice ARROW-6870 addresses Dictionary arrays however this is not as > flexible as structs (for example, cannot mix data types) > The source does have a stub for StructArray however there is no Builder nor > example on how to use it so I can assume it is not supported. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-1921) [Doc] Build API docs on a per-release basis
[ https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-1921: -- Summary: [Doc] Build API docs on a per-release basis (was: Build API docs on a per-release basis) > [Doc] Build API docs on a per-release basis > --- > > Key: ARROW-1921 > URL: https://issues.apache.org/jira/browse/ARROW-1921 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Uwe Korn >Priority: Major > > Currently we build the docs from time to time manually from master. We should > also build them per release so that you can have a look at the latest > released API version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-851) C++/Python: Check Boost/Arrow C++ABI for consistency
[ https://issues.apache.org/jira/browse/ARROW-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974176#comment-16974176 ] Antoine Pitrou commented on ARROW-851: -- Is this issue still relevant? > C++/Python: Check Boost/Arrow C++ABI for consistency > > > Key: ARROW-851 > URL: https://issues.apache.org/jira/browse/ARROW-851 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Uwe Korn >Priority: Major > > When building with dependencies from conda-forge on a newer system with GCC, > the C++ ABI versions can differ. We need to ensure that the versions match > between Boost, arrow-cpp and pyarrow in our CMake scripts. > Depending on this, we may need to pass {{-D_GLIBCXX_USE_CXX11_ABI=0}} to > {{CMAKE_CXX_FLAGS}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-4429) [Doc] Add git rebase tips to the 'Contributing' page in the developer docs
[ https://issues.apache.org/jira/browse/ARROW-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4429: -- Summary: [Doc] Add git rebase tips to the 'Contributing' page in the developer docs (was: Add git rebase tips to the 'Contributing' page in the developer docs) > [Doc] Add git rebase tips to the 'Contributing' page in the developer docs > -- > > Key: ARROW-4429 > URL: https://issues.apache.org/jira/browse/ARROW-4429 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Tanya Schlusser >Priority: Major > > A recent discussion on the listserv (link below) asked about how contributors > should handle rebasing. It would be helpful if the tips made it into the > developer documentation somehow. I suggest in the ["Contributing to Apache > Arrow"|https://cwiki.apache.org/confluence/display/ARROW/Contributing+to+Apache+Arrow] > page—currently a wiki, but hopefully eventually part of the Sphinx docs > ARROW-4427. > Here is the relevant thread: > [https://lists.apache.org/thread.html/c74d8027184550b8d9041e3f2414b517ffb76ccbc1d5aa4563d364b6@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5067) [Doc] Adding support for versioned documentation
[ https://issues.apache.org/jira/browse/ARROW-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-5067: -- Summary: [Doc] Adding support for versioned documentation (was: Adding support for versioned documentation) > [Doc] Adding support for versioned documentation > > > Key: ARROW-5067 > URL: https://issues.apache.org/jira/browse/ARROW-5067 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Affects Versions: 0.9.0, 0.12.1 >Reporter: Helen Ngo >Priority: Minor > Labels: beginner, documentation, newbie, starter > Original Estimate: 504h > Remaining Estimate: 504h > > Per this GitHub issue: [https://github.com/apache/arrow/issues/4016] > *The request:* API docs for previous PyArrow versions as a drop-down link in > the sidebar of this page: [https://arrow.apache.org/docs/python/api.html] > I often work in multiple environments with PyArrow versions ranging from > 0.8.0 to 0.12+, and the core Python API has changed quite a bit > version-to-version so code occasionally no longer works. It can be difficult > to hunt down the corresponding changes in each version. Versioned docs > similar to pandas and other popular libraries would be excellent. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6389) java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
[ https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6389: -- Priority: Major (was: Blocker) > java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] > > > Key: ARROW-6389 > URL: https://issues.apache.org/jira/browse/ARROW-6389 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.14.1 > Environment: Hadoop 2.85 > EMR 5.24.1 > python version: 3.7.4 > skein version: 0.8.0 >Reporter: Ben Schreck >Priority: Major > > I can't access hdfs through pyarrow ( from inside a yarn container created by > skein) > This code works in a jupyter notebook running on the master node, or in an > ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var: > ```{{import pyarrow; pyarrow.hdfs.connect()```}} > > However, when running on yarn by submitting the following skein application, > I get a Java error. > > {{name: test_conn > queue: default > master: > env: > ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native > JAVA_HOME: /etc/alternatives/jre > resources: > vcores: 1 > memory: 10 GiB > files: > conda_env: /home/hadoop/environment.tar.gz > script: | > echo $HADOOP_HOME > echo $JAVA_HOME > echo $HADOOP_CLASSPATH > echo $ARROW_LIBHDFS_DIR > source conda_env/bin/activate > python -c "import pyarrow; pyarrow.hdfs.connect(); > print(fs.open('test.txt').read())" > echo "Hello World!"}} > FYI I tried with/without all those extra env vars, to no effect. I also tried > modifying the EMR cluster with any of the following > > {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs" > "fs.AbstractFileSystem.hdfs.impl": > "org.apache.hadoop.hdfs.DistributedFileSystem" > "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}} > The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- > it was able to find which class by name to use for the "hdfs://" prefix, > namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find > that class. > Logs: > > {{= > LogType:application.driver.log > Log Upload Time:Thu Aug 29 20:51:59 + 2019 > LogLength:2635 > Log Contents: > /usr/lib/hadoop > /usr/lib/jvm/java-openjdk > :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* > hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=(NULL)) error: > java.io.IOException: No FileSystem for scheme: hdfs > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) > at > org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411) > Traceback (most recent call last): > File "", line 1, in > File > "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", >
[jira] [Updated] (ARROW-6389) [Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]
[ https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6389: -- Summary: [Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] (was: java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR]) > [Python] java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] > - > > Key: ARROW-6389 > URL: https://issues.apache.org/jira/browse/ARROW-6389 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.14.1 > Environment: Hadoop 2.85 > EMR 5.24.1 > python version: 3.7.4 > skein version: 0.8.0 >Reporter: Ben Schreck >Priority: Major > > I can't access hdfs through pyarrow ( from inside a yarn container created by > skein) > This code works in a jupyter notebook running on the master node, or in an > ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var: > ```{{import pyarrow; pyarrow.hdfs.connect()```}} > > However, when running on yarn by submitting the following skein application, > I get a Java error. > > {{name: test_conn > queue: default > master: > env: > ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native > JAVA_HOME: /etc/alternatives/jre > resources: > vcores: 1 > memory: 10 GiB > files: > conda_env: /home/hadoop/environment.tar.gz > script: | > echo $HADOOP_HOME > echo $JAVA_HOME > echo $HADOOP_CLASSPATH > echo $ARROW_LIBHDFS_DIR > source conda_env/bin/activate > python -c "import pyarrow; pyarrow.hdfs.connect(); > print(fs.open('test.txt').read())" > echo "Hello World!"}} > FYI I tried with/without all those extra env vars, to no effect. I also tried > modifying the EMR cluster with any of the following > > {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs" > "fs.AbstractFileSystem.hdfs.impl": > "org.apache.hadoop.hdfs.DistributedFileSystem" > "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}} > The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- > it was able to find which class by name to use for the "hdfs://" prefix, > namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find > that class. > Logs: > > {{= > LogType:application.driver.log > Log Upload Time:Thu Aug 29 20:51:59 + 2019 > LogLength:2635 > Log Contents: > /usr/lib/hadoop > /usr/lib/jvm/java-openjdk > :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* > hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=(NULL)) error: > java.io.IOException: No FileSystem for scheme: hdfs > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) > at > org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411) > Traceback (most recent call last): > File "", line 1, in > File > "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_01/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File >
[jira] [Commented] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies
[ https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974175#comment-16974175 ] Antoine Pitrou commented on ARROW-6361: --- [~emkornfield] > [Java] sbt docker publish fails due to Arrow dependecies > > > Key: ARROW-6361 > URL: https://issues.apache.org/jira/browse/ARROW-6361 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.14.1 >Reporter: Boris V.Kuznetsov >Priority: Major > Attachments: tree.txt > > > Hello guys > I'm using Arrow in my Scala project and included Maven deps in sbt as > required. > However, when I try to publish a Docker container with sbt 'docker:publish', > I get the following error: > [error] 1 error was encountered during merge > [error] java.lang.RuntimeException: deduplicate: different file contents > found in the following: > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties > My project is [here|https://github.com/Clover-Group/tsp/tree/kafka]. > You may check project dependency tree attached. > How do I fix that? > Thank you -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6361) [Java] sbt docker publish fails due to Arrow dependecies
[ https://issues.apache.org/jira/browse/ARROW-6361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-6361: -- Summary: [Java] sbt docker publish fails due to Arrow dependecies (was: sbt docker publish fails due to Arrow dependecies) > [Java] sbt docker publish fails due to Arrow dependecies > > > Key: ARROW-6361 > URL: https://issues.apache.org/jira/browse/ARROW-6361 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.14.1 >Reporter: Boris V.Kuznetsov >Priority: Major > Attachments: tree.txt > > > Hello guys > I'm using Arrow in my Scala project and included Maven deps in sbt as > required. > However, when I try to publish a Docker container with sbt 'docker:publish', > I get the following error: > [error] 1 error was encountered during merge > [error] java.lang.RuntimeException: deduplicate: different file contents > found in the following: > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-vector/0.14.1/arrow-vector-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-format/0.14.1/arrow-format-0.14.1.jar:git.properties > [error] > /home/bku/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/arrow/arrow-memory/0.14.1/arrow-memory-0.14.1.jar:git.properties > My project is [here|https://github.com/Clover-Group/tsp/tree/kafka]. > You may check project dependency tree attached. > How do I fix that? > Thank you -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7158) [C++][Visual Studio]Build config Error on non English Version visual studio.
[ https://issues.apache.org/jira/browse/ARROW-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974174#comment-16974174 ] Antoine Pitrou commented on ARROW-7158: --- cc [~kou] > [C++][Visual Studio]Build config Error on non English Version visual studio. > > > Key: ARROW-7158 > URL: https://issues.apache.org/jira/browse/ARROW-7158 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.15.1 >Reporter: Yiun Seungryong >Priority: Minor > > * Build Config Error on Non English OS > * always show > {code:java} > Not supported MSVC compiler {code} > * > [https://github.com/apache/arrow/blob/master/cpp/cmake_modules/CompilerInfo.cmake#L44] > There is a bug in the code below. > {code:java} > if(MSVC) > set(COMPILER_FAMILY "msvc") > if("${COMPILER_VERSION_FULL}" MATCHES > ".*Microsoft ?\\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64"){code} > * In my compiler the version display contains Korean. > {code:java} > Microsoft (R) C/C++ 최적화 컴파일러 버전 19.00.24215.1(x64){code} > * Regular expression seems to need to be changed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7154) [C++] Build error when building tests but not with snappy
[ https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou closed ARROW-7154. - Resolution: Cannot Reproduce > [C++] Build error when building tests but not with snappy > - > > Key: ARROW-7154 > URL: https://issues.apache.org/jira/browse/ARROW-7154 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > > Since the docker-compose PR landed, I am having build errors like: > {code:java} > [361/376] Linking CXX executable debug/arrow-python-test > FAILED: debug/arrow-python-test > : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache > /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ > -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 > -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong > -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 > -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror > -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro > -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic > src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o > debug/arrow-python-test > -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib > debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 > debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 > /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread > -lpthread -ldl -lutil -lrt -ldl > /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a > /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so > jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt > /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, > not found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not > found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::system::detail::generic_category_ncx()' > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::filesystem::path::operator/=(boost::filesystem::path const&)' > collect2: error: ld returned 1 exit status > {code} > which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed > by debug/libarrow.so.100.0.0, not found" (although this is certainly present). > The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set > to OFF, it works fine. > It also seems to be related to this specific change in the docker compose PR: > {code:java} > diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > index c80ac3310..3b3c9eb8f 100644 > --- a/cpp/CMakeLists.txt > +++ b/cpp/CMakeLists.txt > @@ -266,6 +266,15 @@ endif(UNIX) > # Set up various options > # > -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) > - # Currently the compression tests require at least these libraries; bz2 and > - # zstd are optional. See ARROW-3984 > - set(ARROW_WITH_BROTLI ON) > - set(ARROW_WITH_LZ4 ON) > - set(ARROW_WITH_SNAPPY ON) > - set(ARROW_WITH_ZLIB ON) > -endif() > - > if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) >set(ARROW_JSON ON) > endif() > {code} > If I add that back, the build works. > With only `set(ARROW_WITH_BROTLI ON)`, it still fails > With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about > liblz4 instead of libboost (but also liblz4 is actually present) > With only `set(ARROW_WITH_SNAPPY ON)`, it works > With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about > libz.so.1 not found > With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also > works. So it seems that the absence of snappy causes others to fail. > In the recommended build settings in the development docs > ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] > the compression libraries are enabled. But I was still building without them > (stemming from the time they were enabled by
[jira] [Commented] (ARROW-7154) [C++] Build error when building tests but not with snappy
[ https://issues.apache.org/jira/browse/ARROW-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974173#comment-16974173 ] Antoine Pitrou commented on ARROW-7154: --- Ok, closing. > [C++] Build error when building tests but not with snappy > - > > Key: ARROW-7154 > URL: https://issues.apache.org/jira/browse/ARROW-7154 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > > Since the docker-compose PR landed, I am having build errors like: > {code:java} > [361/376] Linking CXX executable debug/arrow-python-test > FAILED: debug/arrow-python-test > : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache > /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ > -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 > -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong > -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 > -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror > -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro > -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic > src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o > debug/arrow-python-test > -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib > debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 > debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 > /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread > -lpthread -ldl -lutil -lrt -ldl > /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a > /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so > jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt > /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, > not found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not > found (try using -rpath or -rpath-link) > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::system::detail::generic_category_ncx()' > /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: > debug/libarrow.so.100.0.0: undefined reference to > `boost::filesystem::path::operator/=(boost::filesystem::path const&)' > collect2: error: ld returned 1 exit status > {code} > which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed > by debug/libarrow.so.100.0.0, not found" (although this is certainly present). > The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set > to OFF, it works fine. > It also seems to be related to this specific change in the docker compose PR: > {code:java} > diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt > index c80ac3310..3b3c9eb8f 100644 > --- a/cpp/CMakeLists.txt > +++ b/cpp/CMakeLists.txt > @@ -266,6 +266,15 @@ endif(UNIX) > # Set up various options > # > -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) > - # Currently the compression tests require at least these libraries; bz2 and > - # zstd are optional. See ARROW-3984 > - set(ARROW_WITH_BROTLI ON) > - set(ARROW_WITH_LZ4 ON) > - set(ARROW_WITH_SNAPPY ON) > - set(ARROW_WITH_ZLIB ON) > -endif() > - > if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) >set(ARROW_JSON ON) > endif() > {code} > If I add that back, the build works. > With only `set(ARROW_WITH_BROTLI ON)`, it still fails > With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about > liblz4 instead of libboost (but also liblz4 is actually present) > With only `set(ARROW_WITH_SNAPPY ON)`, it works > With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about > libz.so.1 not found > With both `set(ARROW_WITH_SNAPPY ON)` and `set(ARROW_WITH_ZLIB ON)`, it also > works. So it seems that the absence of snappy causes others to fail. > In the recommended build settings in the development docs > ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] > the compression libraries are enabled. But I was still building without them > (stemming from the time they
[jira] [Updated] (ARROW-7162) [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake
[ https://issues.apache.org/jira/browse/ARROW-7162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7162: -- Labels: pull-request-available (was: ) > [C++] Cleanup warnings in cmake_modules/SetupCxxFlags.cmake > --- > > Key: ARROW-7162 > URL: https://issues.apache.org/jira/browse/ARROW-7162 > Project: Apache Arrow > Issue Type: Wish > Components: C++, Developer Tools >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > For clang we currently disable a lot of warnings explicitly. This dates back > to when we enabled {{-Weverything}}. We should probably remove most or all of > these flags now. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7163) [Doc] Fix double-and typos
[ https://issues.apache.org/jira/browse/ARROW-7163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974169#comment-16974169 ] Brian Wignall commented on ARROW-7163: -- Thank you for creating the Jira issue for me, Neal. > [Doc] Fix double-and typos > -- > > Key: ARROW-7163 > URL: https://issues.apache.org/jira/browse/ARROW-7163 > Project: Apache Arrow > Issue Type: Bug > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)