Re: Arrow Plasma Java Issues

2018-11-13 Thread Praveen Kumar
Hi Brett,

You would need to set the java library path (pointing to the location of
the plasma native library).

Please see this script which executes tests for plasma on how to do this (
https://github.com/apache/arrow/blob/master/java/plasma/test.sh).

Thx.

On Wed, Nov 14, 2018 at 2:47 AM Brett Kosciolek (RIT Student) <
bpk9...@rit.edu> wrote:

> I'm trying to use the Java API for Apache Arrow to connect to a memory
> store. I've done this in Python, successfully, using the Python API by
> following the guide here.
>
> I've also looked at the C++ API documentation, but it didn't help much.
>
> The Java Docs makes it look similar to the other documentation.
>
> Make sure the plasma object store is running (usually "/tmp/plasma" for
> the examples).
> Create client
> Connect to the client by providing the object store
> ("/tmp/plasma"), and ("", 0) for the other two parameters.
>
> However, when attempting to use the following line, I get an
> UnsatisfiedLinkError, that I can't find any reference to within the Apache
> Arrow documentation. Other solutions found of google (such as calling
> System.load) haven't been successful either.
>
> PlasmaClient client = new PlasmaClient("/tmp/plasma", "", 0);
>
> A copy of my error messages can be seen below:
>
> Exception in thread "main"
>
> java.lang.UnsatisfiedLinkError:org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J
> at org.apache.arrow.plasma.PlasmaClientJNI.connect(Native Method) at
> org.apache.arrow.plasma.PlasmaClient.(PlasmaClient.java:44) at
> plas.main(plas.java:11)
>
> Any help is appreciated. Thank you!
>


[jira] [Created] (ARROW-3787) Implement From for BinaryArray

2018-11-13 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-3787:
--

 Summary: Implement From for BinaryArray
 Key: ARROW-3787
 URL: https://issues.apache.org/jira/browse/ARROW-3787
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.

2018-11-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-3786:
---

 Summary: Enable merge_arrow_pr.py script to run in non-English 
JIRA accounts.
 Key: ARROW-3786
 URL: https://issues.apache.org/jira/browse/ARROW-3786
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Yosuke Shiro


I read [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts]
 
I did the following instruction.
{code:java}
dev/merge_arrow_pr.py{code}
I got the following result.
{code:java}
Would you like to update the associated JIRA? (y/n): y
Enter comma-separated fix version(s) [0.12.0]:
=== JIRA ARROW-3748 ===
summary [GLib] Add GArrowCSVReader
assigneeKouhei Sutou
status  オープン
url https://issues.apache.org/jira/browse/ARROW-3748
 
list index out of range{code}
 
It looks like an error on 
[https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] .
My JIRA account language is Japanese.
This script does not seem to work if it is not English.
{code:java}
print(self.jira_con.transitions(self.jira_id))

[{'id': '701', 'name': '課題のクローズ', 'to': {'self': 
'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 
討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 
'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 
'クローズ', 'id': '6', 'statusCategory': {'self': 
'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 'key': 
'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': 
'課題を再オープンする', 'to': {'self': 
'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': 
'課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 'iconUrl': 
'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': 
'再オープン', 'id': '4', 'statusCategory': {'self': 
'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 'key': 
'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3785) [C++] Use double-conversion conda package in CI toolchain

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3785:
---

 Summary: [C++] Use double-conversion conda package in CI toolchain
 Key: ARROW-3785
 URL: https://issues.apache.org/jira/browse/ARROW-3785
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This is being built from the EP currently



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3784) [R] Array with type fails with x is not a vector

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3784:
--

 Summary: [R] Array with type fails with x is not a vector 
 Key: ARROW-3784
 URL: https://issues.apache.org/jira/browse/ARROW-3784
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


{code:java}
array(1:10, type = int32())
{code}
Actual:
{code:java}
 Error: `x` is not a vector 
{code}
Expected:
{code:java}
arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3783) [R] Incorrect collection of float type

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3783:
--

 Summary: [R] Incorrect collection of float type
 Key: ARROW-3783
 URL: https://issues.apache.org/jira/browse/ARROW-3783
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Repro from `sparklyr`:

 
{code:java}
library(sparklyr)
library(arrow)

sc <- spark_connect(master = "local")
DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code}
 

Actual:
{code:java}
  CAST(1 AS FLOAT)
1   1065353216{code}
Expected:

 
{code:java}
  CAST(1 AS FLOAT)
11{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Arrow Plasma Java Issues

2018-11-13 Thread Brett Kosciolek (RIT Student)
I'm trying to use the Java API for Apache Arrow to connect to a memory
store. I've done this in Python, successfully, using the Python API by
following the guide here.

I've also looked at the C++ API documentation, but it didn't help much.

The Java Docs makes it look similar to the other documentation.

Make sure the plasma object store is running (usually "/tmp/plasma" for
the examples).
Create client
Connect to the client by providing the object store
("/tmp/plasma"), and ("", 0) for the other two parameters.

However, when attempting to use the following line, I get an
UnsatisfiedLinkError, that I can't find any reference to within the Apache
Arrow documentation. Other solutions found of google (such as calling
System.load) haven't been successful either.

PlasmaClient client = new PlasmaClient("/tmp/plasma", "", 0);

A copy of my error messages can be seen below:

Exception in thread "main"
java.lang.UnsatisfiedLinkError:org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J
at org.apache.arrow.plasma.PlasmaClientJNI.connect(Native Method) at
org.apache.arrow.plasma.PlasmaClient.(PlasmaClient.java:44) at
plas.main(plas.java:11)

Any help is appreciated. Thank you!


[jira] [Created] (ARROW-3782) [C++] Implement BufferedReader for C++

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3782:
---

 Summary: [C++] Implement BufferedReader for C++
 Key: ARROW-3782
 URL: https://issues.apache.org/jira/browse/ARROW-3782
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This will be the reader companion to {{arrow::io::BufferedOutputStream}} and a 
C++-like version of the {{io.BufferedReader}} class in the Python standard 
library

https://docs.python.org/3/library/io.html#io.BufferedReader

We already have a partial version of this that's used in the Parquet library

https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L413

In particular we need

* Seek implemented for random access (it will invalidate the buffer)
* Peek method returning {{shared_ptr}}, a zero copy view into buffered 
memory

This is needed for ARROW-3126



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3781:
---

 Summary: [C++] Configure buffer size in 
arrow::io::BufferedOutputStream
 Key: ARROW-3781
 URL: https://issues.apache.org/jira/browse/ARROW-3781
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This is hard-coded to 4096 right now. For higher latency file systems it may be 
desirable to use a larger buffer. See also ARROW-3777 about performance testing 
for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3780:
--

 Summary: [R] Failed to fetch data: invalid data when collecting 
int16
 Key: ARROW-3780
 URL: https://issues.apache.org/jira/browse/ARROW-3780
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Repro from sparklyr unit test:
{code:java}
library(dplyr)
library(sparklyr)
library(arrow)

sc <- spark_connect(master = "local")

hive_type <- tibble::frame_data(
 ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
 "smallint", "1", "integer", "1", "integer",
)

spark_query <- hive_type %>%
 mutate(
 query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
stype), "_col")
 ) %>%
 pull(query) %>%
 paste(collapse = ", ") %>%
 paste("SELECT", .)

spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
 lapply(function(e) class(e)[[1]]) %>%
 as.character(){code}
Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3779) [Format] Standardize timezone specification

2018-11-13 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3779:
--

 Summary: [Format] Standardize timezone specification
 Key: ARROW-3779
 URL: https://issues.apache.org/jira/browse/ARROW-3779
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3778) [C++] Don't put implementations in test-util.h

2018-11-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3778:
-

 Summary: [C++] Don't put implementations in test-util.h
 Key: ARROW-3778
 URL: https://issues.apache.org/jira/browse/ARROW-3778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.1
Reporter: Antoine Pitrou


{{test-util.h}} is included in most (all?) test files, and it's quite long to 
compile because it includes many other files and recompiles helper functions 
all the time. Instead we should have only declarations in {{test-util.h}} and 
put implementations in a separate {{.cc}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-13 Thread Wes McKinney
hi Roman,

I agree with you that it is not a small change because of the new
union-based logical type representation, and compatibility for old
Parquet files (as well as an option to write "old" metadata for
compatibility with old Parquet readers).

- Wes
On Tue, Nov 13, 2018 at 10:13 AM Roman Karlstetter
 wrote:
>
> Hi,
>
> that sounds like the task might not be ideally suited for someone new to 
> implementations of both arrow and parquet, especially since all that 
> compatibility issues should be handled correctly.
> I think it does not make sense for me to continue with this implementation, 
> unless there are some further specifications on how this should be 
> implemented.
>
> Roman
>
> Von: Wes McKinney
> Gesendet: Montag, 12. November 2018 16:50
> An: dev@arrow.apache.org
> Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
>
> hi Roman,
>
> For nanosecond Arrow timestamps, the relevant code path for this is here:
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607
>
> You'll also have to modify some code in parquet/types.*,
> parquet/schema.*, parquet/arrow/schema.cc to handle the additional
> metadata. If you aren't dealing with Arrow at all, then it should be
> sufficient just to modify the handling of the logical types metadata
> in parquet/types.*.
>
> So there is a significant complication that I didn't think about yet:
> we aren't handling the new logical types union in parquet-cpp yet, so
> there's quite a lot of work beyond just dealing with the nanosecond
> metadata. I am also not sure what are the implications for backwards
> compatibility and haven't had time to look in detail at what needs to
> be done since the new metadata structure was added to the Thrift
> definition
>
> - Wes
> On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter
>  wrote:
> >
> > I've had the chance to look into this.
> > There is one issue that came up which I don't know how to handle. 
> > Previously, int96 seems to have been used for nanosecond precision, but 
> > this is somewhat deprecated, as far as I understand it.
> > So, how should we handle nanoseconds and int96 vs int64 in 1) reading from 
> > and b) writing to parquet.
> > There seem to be some writer settings, all related to timestamp precision 
> > properties. Is there any advise someone of you can give me in that regard?
> >
> > Thanks,
> > Roman
> >
> > Von: Roman Karlstetter
> > Gesendet: Freitag, 9. November 2018 08:38
> > An: dev@arrow.apache.org
> > Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp
> >
> > I would be willing to implement that. I’ll probably need some advice on my 
> > patch though, as I’m fairly new to the parquet code.
> >
> > Roman
> >
> > Von: Wes McKinney
> > Gesendet: Donnerstag, 8. November 2018 23:22
> > An: dev@arrow.apache.org
> > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
> >
> > I opened an issue here
> > https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
> > welcome
> > On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
> > >
> > > hi Roman,
> > >
> > > We would welcome adding such a document to the Arrow wiki
> > > https://cwiki.apache.org/confluence/display/ARROW. As to your other
> > > questions, it really depends on whether there is a member of the
> > > Parquet community who will do the work. Patches that implement any
> > > released functionality in the Parquet format specification are
> > > welcome.
> > >
> > > Thanks
> > > Wes
> > > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
> > >  wrote:
> > > >
> > > > Hi everyone,
> > > > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > > > https://github.com/apache/parquet-format/pull/102
> > > > For parquet-cpp, this is not yet supported. I have a few questions now:
> > > > • is there an overview of what release of parquet-format is currently 
> > > > fully support in parquet-cpp (something like a feature support matrix)?
> > > > • how fast are new features in parquet-format adopted?
> > > > I think having a document describing the current completeness of 
> > > > implementation of the spec would be very helpful for users of the 
> > > > parquet-cpp library.
> > > > Thanks,
> > > > Roman
> > > >
> > > >
> >
> >
>


Re: Help organize Parquet-related C++, Python issues

2018-11-13 Thread Wes McKinney
I just spent some time combing through Arrow JIRA issues that mention "parquet"

We now have 60 Python-related issues appropriately labeled

https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development

I noted there are some bugs reported that are duplicates of each
other, but will need to examine more closely to confirm

There's another 17 that are more C++-related

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95655294
On Mon, Nov 12, 2018 at 1:16 PM Wes McKinney  wrote:
>
> hi folks,
>
> As some of you may have noticed, we are accumulating a mountain of
> Parquet-related JIRA issues, many of them resulting from people using
> Apache Arrow to do data engineering in Python and running into
> problems.
>
> To help with having better visibility into all the relevant Parquet
> issues, and with the monorepo merge behind us, I created a couple wiki
> pages linked to from the main
> https://cwiki.apache.org/confluence/display/ARROW page:
>
> * C++ issue dashboard: https://cwiki.apache.org/confluence/x/fpWzBQ
> * Python issue dashboard:
> https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Development
>
> Many Parquet issues in the ARROW project are not found in these
> dashboards because they lack the "parquet" label. Please help with
> project organization by remembering to apply the "parquet" label to
> any issue.
>
> Since Ruby also supports Parquet now via GLib, and R support for
> Parquet is coming soon, we need to do what we can to grow the
> community of people working on the core Parquet libraries and the
> things they depend on, like the IO and memory management subsystems of
> the Arrow C++ libraries.
>
> In general, I think it is very important for us to have fast and
> reliable C++ support (and language bindings) for the 5 major file
> formats in use in data warehousing:
>
> * CSV
> * JSON
> * Parquet
> * Avro
> * ORC
>
> Antoine has been leading efforts on reading CSV files, and we will
> need to make a push into JSON and Avro at some point.
>
> Thanks
> Wes


AW: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-13 Thread Roman Karlstetter
Hi,

that sounds like the task might not be ideally suited for someone new to 
implementations of both arrow and parquet, especially since all that 
compatibility issues should be handled correctly.
I think it does not make sense for me to continue with this implementation, 
unless there are some further specifications on how this should be implemented.

Roman

Von: Wes McKinney
Gesendet: Montag, 12. November 2018 16:50
An: dev@arrow.apache.org
Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp

hi Roman,

For nanosecond Arrow timestamps, the relevant code path for this is here:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607

You'll also have to modify some code in parquet/types.*,
parquet/schema.*, parquet/arrow/schema.cc to handle the additional
metadata. If you aren't dealing with Arrow at all, then it should be
sufficient just to modify the handling of the logical types metadata
in parquet/types.*.

So there is a significant complication that I didn't think about yet:
we aren't handling the new logical types union in parquet-cpp yet, so
there's quite a lot of work beyond just dealing with the nanosecond
metadata. I am also not sure what are the implications for backwards
compatibility and haven't had time to look in detail at what needs to
be done since the new metadata structure was added to the Thrift
definition

- Wes
On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter
 wrote:
>
> I've had the chance to look into this.
> There is one issue that came up which I don't know how to handle. Previously, 
> int96 seems to have been used for nanosecond precision, but this is somewhat 
> deprecated, as far as I understand it.
> So, how should we handle nanoseconds and int96 vs int64 in 1) reading from 
> and b) writing to parquet.
> There seem to be some writer settings, all related to timestamp precision 
> properties. Is there any advise someone of you can give me in that regard?
>
> Thanks,
> Roman
>
> Von: Roman Karlstetter
> Gesendet: Freitag, 9. November 2018 08:38
> An: dev@arrow.apache.org
> Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I would be willing to implement that. I’ll probably need some advice on my 
> patch though, as I’m fairly new to the parquet code.
>
> Roman
>
> Von: Wes McKinney
> Gesendet: Donnerstag, 8. November 2018 23:22
> An: dev@arrow.apache.org
> Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I opened an issue here
> https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
> welcome
> On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
> >
> > hi Roman,
> >
> > We would welcome adding such a document to the Arrow wiki
> > https://cwiki.apache.org/confluence/display/ARROW. As to your other
> > questions, it really depends on whether there is a member of the
> > Parquet community who will do the work. Patches that implement any
> > released functionality in the Parquet format specification are
> > welcome.
> >
> > Thanks
> > Wes
> > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
> >  wrote:
> > >
> > > Hi everyone,
> > > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > > https://github.com/apache/parquet-format/pull/102
> > > For parquet-cpp, this is not yet supported. I have a few questions now:
> > > • is there an overview of what release of parquet-format is currently 
> > > fully support in parquet-cpp (something like a feature support matrix)?
> > > • how fast are new features in parquet-format adopted?
> > > I think having a document describing the current completeness of 
> > > implementation of the spec would be very helpful for users of the 
> > > parquet-cpp library.
> > > Thanks,
> > > Roman
> > >
> > >
>
>



[jira] [Created] (ARROW-3777) [Python] Implement a mock "high latency" filesystem

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3777:
---

 Summary: [Python] Implement a mock "high latency" filesystem
 Key: ARROW-3777
 URL: https://issues.apache.org/jira/browse/ARROW-3777
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Wes McKinney


Some of our tools don't perform well out of the box for filesystems with high 
latency reads, like cloud blob stores. In such cases, it may be better to use 
buffered reads with a larger read ahead window. Having a mock filesystem to 
introduce latency into reads will help with testing / developing APIs for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)