Re: Arrow Plasma Java Issues
Hi Brett, You would need to set the java library path (pointing to the location of the plasma native library). Please see this script which executes tests for plasma on how to do this ( https://github.com/apache/arrow/blob/master/java/plasma/test.sh). Thx. On Wed, Nov 14, 2018 at 2:47 AM Brett Kosciolek (RIT Student) < bpk9...@rit.edu> wrote: > I'm trying to use the Java API for Apache Arrow to connect to a memory > store. I've done this in Python, successfully, using the Python API by > following the guide here. > > I've also looked at the C++ API documentation, but it didn't help much. > > The Java Docs makes it look similar to the other documentation. > > Make sure the plasma object store is running (usually "/tmp/plasma" for > the examples). > Create client > Connect to the client by providing the object store > ("/tmp/plasma"), and ("", 0) for the other two parameters. > > However, when attempting to use the following line, I get an > UnsatisfiedLinkError, that I can't find any reference to within the Apache > Arrow documentation. Other solutions found of google (such as calling > System.load) haven't been successful either. > > PlasmaClient client = new PlasmaClient("/tmp/plasma", "", 0); > > A copy of my error messages can be seen below: > > Exception in thread "main" > > java.lang.UnsatisfiedLinkError:org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J > at org.apache.arrow.plasma.PlasmaClientJNI.connect(Native Method) at > org.apache.arrow.plasma.PlasmaClient.(PlasmaClient.java:44) at > plas.main(plas.java:11) > > Any help is appreciated. Thank you! >
[jira] [Created] (ARROW-3787) Implement From for BinaryArray
Paddy Horan created ARROW-3787: -- Summary: Implement From for BinaryArray Key: ARROW-3787 URL: https://issues.apache.org/jira/browse/ARROW-3787 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.
Yosuke Shiro created ARROW-3786: --- Summary: Enable merge_arrow_pr.py script to run in non-English JIRA accounts. Key: ARROW-3786 URL: https://issues.apache.org/jira/browse/ARROW-3786 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Yosuke Shiro I read [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts] I did the following instruction. {code:java} dev/merge_arrow_pr.py{code} I got the following result. {code:java} Would you like to update the associated JIRA? (y/n): y Enter comma-separated fix version(s) [0.12.0]: === JIRA ARROW-3748 === summary [GLib] Add GArrowCSVReader assigneeKouhei Sutou status オープン url https://issues.apache.org/jira/browse/ARROW-3748 list index out of range{code} It looks like an error on [https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] . My JIRA account language is Japanese. This script does not seem to work if it is not English. {code:java} print(self.jira_con.transitions(self.jira_id)) [{'id': '701', 'name': '課題のクローズ', 'to': {'self': 'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 'クローズ', 'id': '6', 'statusCategory': {'self': 'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 'key': 'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': '課題を再オープンする', 'to': {'self': 'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': '課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 'iconUrl': 'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': '再オープン', 'id': '4', 'statusCategory': {'self': 'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 'key': 'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3785) [C++] Use double-conversion conda package in CI toolchain
Wes McKinney created ARROW-3785: --- Summary: [C++] Use double-conversion conda package in CI toolchain Key: ARROW-3785 URL: https://issues.apache.org/jira/browse/ARROW-3785 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This is being built from the EP currently -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3784) [R] Array with type fails with x is not a vector
Javier Luraschi created ARROW-3784: -- Summary: [R] Array with type fails with x is not a vector Key: ARROW-3784 URL: https://issues.apache.org/jira/browse/ARROW-3784 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi {code:java} array(1:10, type = int32()) {code} Actual: {code:java} Error: `x` is not a vector {code} Expected: {code:java} arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3783) [R] Incorrect collection of float type
Javier Luraschi created ARROW-3783: -- Summary: [R] Incorrect collection of float type Key: ARROW-3783 URL: https://issues.apache.org/jira/browse/ARROW-3783 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi Repro from `sparklyr`: {code:java} library(sparklyr) library(arrow) sc <- spark_connect(master = "local") DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code} Actual: {code:java} CAST(1 AS FLOAT) 1 1065353216{code} Expected: {code:java} CAST(1 AS FLOAT) 11{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Arrow Plasma Java Issues
I'm trying to use the Java API for Apache Arrow to connect to a memory store. I've done this in Python, successfully, using the Python API by following the guide here. I've also looked at the C++ API documentation, but it didn't help much. The Java Docs makes it look similar to the other documentation. Make sure the plasma object store is running (usually "/tmp/plasma" for the examples). Create client Connect to the client by providing the object store ("/tmp/plasma"), and ("", 0) for the other two parameters. However, when attempting to use the following line, I get an UnsatisfiedLinkError, that I can't find any reference to within the Apache Arrow documentation. Other solutions found of google (such as calling System.load) haven't been successful either. PlasmaClient client = new PlasmaClient("/tmp/plasma", "", 0); A copy of my error messages can be seen below: Exception in thread "main" java.lang.UnsatisfiedLinkError:org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J at org.apache.arrow.plasma.PlasmaClientJNI.connect(Native Method) at org.apache.arrow.plasma.PlasmaClient.(PlasmaClient.java:44) at plas.main(plas.java:11) Any help is appreciated. Thank you!
[jira] [Created] (ARROW-3782) [C++] Implement BufferedReader for C++
Wes McKinney created ARROW-3782: --- Summary: [C++] Implement BufferedReader for C++ Key: ARROW-3782 URL: https://issues.apache.org/jira/browse/ARROW-3782 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This will be the reader companion to {{arrow::io::BufferedOutputStream}} and a C++-like version of the {{io.BufferedReader}} class in the Python standard library https://docs.python.org/3/library/io.html#io.BufferedReader We already have a partial version of this that's used in the Parquet library https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L413 In particular we need * Seek implemented for random access (it will invalidate the buffer) * Peek method returning {{shared_ptr}}, a zero copy view into buffered memory This is needed for ARROW-3126 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
Wes McKinney created ARROW-3781: --- Summary: [C++] Configure buffer size in arrow::io::BufferedOutputStream Key: ARROW-3781 URL: https://issues.apache.org/jira/browse/ARROW-3781 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This is hard-coded to 4096 right now. For higher latency file systems it may be desirable to use a larger buffer. See also ARROW-3777 about performance testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
Javier Luraschi created ARROW-3780: -- Summary: [R] Failed to fetch data: invalid data when collecting int16 Key: ARROW-3780 URL: https://issues.apache.org/jira/browse/ARROW-3780 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi Repro from sparklyr unit test: {code:java} library(dplyr) library(sparklyr) library(arrow) sc <- spark_connect(master = "local") hive_type <- tibble::frame_data( ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, "smallint", "1", "integer", "1", "integer", ) spark_query <- hive_type %>% mutate( query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", stype), "_col") ) %>% pull(query) %>% paste(collapse = ", ") %>% paste("SELECT", .) spark_types <- DBI::dbGetQuery(sc, spark_query) %>% lapply(function(e) class(e)[[1]]) %>% as.character(){code} Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3779) [Format] Standardize timezone specification
Krisztian Szucs created ARROW-3779: -- Summary: [Format] Standardize timezone specification Key: ARROW-3779 URL: https://issues.apache.org/jira/browse/ARROW-3779 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3778) [C++] Don't put implementations in test-util.h
Antoine Pitrou created ARROW-3778: - Summary: [C++] Don't put implementations in test-util.h Key: ARROW-3778 URL: https://issues.apache.org/jira/browse/ARROW-3778 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.1 Reporter: Antoine Pitrou {{test-util.h}} is included in most (all?) test files, and it's quite long to compile because it includes many other files and recompiles helper functions all the time. Instead we should have only declarations in {{test-util.h}} and put implementations in a separate {{.cc}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Support for TIMESTAMP_NANOS in parquet-cpp
hi Roman, I agree with you that it is not a small change because of the new union-based logical type representation, and compatibility for old Parquet files (as well as an option to write "old" metadata for compatibility with old Parquet readers). - Wes On Tue, Nov 13, 2018 at 10:13 AM Roman Karlstetter wrote: > > Hi, > > that sounds like the task might not be ideally suited for someone new to > implementations of both arrow and parquet, especially since all that > compatibility issues should be handled correctly. > I think it does not make sense for me to continue with this implementation, > unless there are some further specifications on how this should be > implemented. > > Roman > > Von: Wes McKinney > Gesendet: Montag, 12. November 2018 16:50 > An: dev@arrow.apache.org > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp > > hi Roman, > > For nanosecond Arrow timestamps, the relevant code path for this is here: > > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607 > > You'll also have to modify some code in parquet/types.*, > parquet/schema.*, parquet/arrow/schema.cc to handle the additional > metadata. If you aren't dealing with Arrow at all, then it should be > sufficient just to modify the handling of the logical types metadata > in parquet/types.*. > > So there is a significant complication that I didn't think about yet: > we aren't handling the new logical types union in parquet-cpp yet, so > there's quite a lot of work beyond just dealing with the nanosecond > metadata. I am also not sure what are the implications for backwards > compatibility and haven't had time to look in detail at what needs to > be done since the new metadata structure was added to the Thrift > definition > > - Wes > On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter > wrote: > > > > I've had the chance to look into this. > > There is one issue that came up which I don't know how to handle. > > Previously, int96 seems to have been used for nanosecond precision, but > > this is somewhat deprecated, as far as I understand it. > > So, how should we handle nanoseconds and int96 vs int64 in 1) reading from > > and b) writing to parquet. > > There seem to be some writer settings, all related to timestamp precision > > properties. Is there any advise someone of you can give me in that regard? > > > > Thanks, > > Roman > > > > Von: Roman Karlstetter > > Gesendet: Freitag, 9. November 2018 08:38 > > An: dev@arrow.apache.org > > Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp > > > > I would be willing to implement that. I’ll probably need some advice on my > > patch though, as I’m fairly new to the parquet code. > > > > Roman > > > > Von: Wes McKinney > > Gesendet: Donnerstag, 8. November 2018 23:22 > > An: dev@arrow.apache.org > > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp > > > > I opened an issue here > > https://issues.apache.org/jira/browse/ARROW-3729. Patches would be > > welcome > > On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > > > > > hi Roman, > > > > > > We would welcome adding such a document to the Arrow wiki > > > https://cwiki.apache.org/confluence/display/ARROW. As to your other > > > questions, it really depends on whether there is a member of the > > > Parquet community who will do the work. Patches that implement any > > > released functionality in the Parquet format specification are > > > welcome. > > > > > > Thanks > > > Wes > > > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > > > wrote: > > > > > > > > Hi everyone, > > > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > > > https://github.com/apache/parquet-format/pull/102 > > > > For parquet-cpp, this is not yet supported. I have a few questions now: > > > > • is there an overview of what release of parquet-format is currently > > > > fully support in parquet-cpp (something like a feature support matrix)? > > > > • how fast are new features in parquet-format adopted? > > > > I think having a document describing the current completeness of > > > > implementation of the spec would be very helpful for users of the > > > > parquet-cpp library. > > > > Thanks, > > > > Roman > > > > > > > > > > > > >
Re: Help organize Parquet-related C++, Python issues
I just spent some time combing through Arrow JIRA issues that mention "parquet" We now have 60 Python-related issues appropriately labeled https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development I noted there are some bugs reported that are duplicates of each other, but will need to examine more closely to confirm There's another 17 that are more C++-related https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95655294 On Mon, Nov 12, 2018 at 1:16 PM Wes McKinney wrote: > > hi folks, > > As some of you may have noticed, we are accumulating a mountain of > Parquet-related JIRA issues, many of them resulting from people using > Apache Arrow to do data engineering in Python and running into > problems. > > To help with having better visibility into all the relevant Parquet > issues, and with the monorepo merge behind us, I created a couple wiki > pages linked to from the main > https://cwiki.apache.org/confluence/display/ARROW page: > > * C++ issue dashboard: https://cwiki.apache.org/confluence/x/fpWzBQ > * Python issue dashboard: > https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development > > Many Parquet issues in the ARROW project are not found in these > dashboards because they lack the "parquet" label. Please help with > project organization by remembering to apply the "parquet" label to > any issue. > > Since Ruby also supports Parquet now via GLib, and R support for > Parquet is coming soon, we need to do what we can to grow the > community of people working on the core Parquet libraries and the > things they depend on, like the IO and memory management subsystems of > the Arrow C++ libraries. > > In general, I think it is very important for us to have fast and > reliable C++ support (and language bindings) for the 5 major file > formats in use in data warehousing: > > * CSV > * JSON > * Parquet > * Avro > * ORC > > Antoine has been leading efforts on reading CSV files, and we will > need to make a push into JSON and Avro at some point. > > Thanks > Wes
AW: Support for TIMESTAMP_NANOS in parquet-cpp
Hi, that sounds like the task might not be ideally suited for someone new to implementations of both arrow and parquet, especially since all that compatibility issues should be handled correctly. I think it does not make sense for me to continue with this implementation, unless there are some further specifications on how this should be implemented. Roman Von: Wes McKinney Gesendet: Montag, 12. November 2018 16:50 An: dev@arrow.apache.org Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp hi Roman, For nanosecond Arrow timestamps, the relevant code path for this is here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607 You'll also have to modify some code in parquet/types.*, parquet/schema.*, parquet/arrow/schema.cc to handle the additional metadata. If you aren't dealing with Arrow at all, then it should be sufficient just to modify the handling of the logical types metadata in parquet/types.*. So there is a significant complication that I didn't think about yet: we aren't handling the new logical types union in parquet-cpp yet, so there's quite a lot of work beyond just dealing with the nanosecond metadata. I am also not sure what are the implications for backwards compatibility and haven't had time to look in detail at what needs to be done since the new metadata structure was added to the Thrift definition - Wes On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter wrote: > > I've had the chance to look into this. > There is one issue that came up which I don't know how to handle. Previously, > int96 seems to have been used for nanosecond precision, but this is somewhat > deprecated, as far as I understand it. > So, how should we handle nanoseconds and int96 vs int64 in 1) reading from > and b) writing to parquet. > There seem to be some writer settings, all related to timestamp precision > properties. Is there any advise someone of you can give me in that regard? > > Thanks, > Roman > > Von: Roman Karlstetter > Gesendet: Freitag, 9. November 2018 08:38 > An: dev@arrow.apache.org > Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp > > I would be willing to implement that. I’ll probably need some advice on my > patch though, as I’m fairly new to the parquet code. > > Roman > > Von: Wes McKinney > Gesendet: Donnerstag, 8. November 2018 23:22 > An: dev@arrow.apache.org > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp > > I opened an issue here > https://issues.apache.org/jira/browse/ARROW-3729. Patches would be > welcome > On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > > > hi Roman, > > > > We would welcome adding such a document to the Arrow wiki > > https://cwiki.apache.org/confluence/display/ARROW. As to your other > > questions, it really depends on whether there is a member of the > > Parquet community who will do the work. Patches that implement any > > released functionality in the Parquet format specification are > > welcome. > > > > Thanks > > Wes > > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > > wrote: > > > > > > Hi everyone, > > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > > https://github.com/apache/parquet-format/pull/102 > > > For parquet-cpp, this is not yet supported. I have a few questions now: > > > • is there an overview of what release of parquet-format is currently > > > fully support in parquet-cpp (something like a feature support matrix)? > > > • how fast are new features in parquet-format adopted? > > > I think having a document describing the current completeness of > > > implementation of the spec would be very helpful for users of the > > > parquet-cpp library. > > > Thanks, > > > Roman > > > > > > > >
[jira] [Created] (ARROW-3777) [Python] Implement a mock "high latency" filesystem
Wes McKinney created ARROW-3777: --- Summary: [Python] Implement a mock "high latency" filesystem Key: ARROW-3777 URL: https://issues.apache.org/jira/browse/ARROW-3777 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Wes McKinney Some of our tools don't perform well out of the box for filesystems with high latency reads, like cloud blob stores. In such cases, it may be better to use buffered reads with a larger read ahead window. Having a mock filesystem to introduce latency into reads will help with testing / developing APIs for this -- This message was sent by Atlassian JIRA (v7.6.3#76005)