AW: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-08 Thread Roman Karlstetter
I would be willing to implement that. I’ll probably need some advice on my 
patch though, as I’m fairly new to the parquet code.

Roman

Von: Wes McKinney
Gesendet: Donnerstag, 8. November 2018 23:22
An: dev@arrow.apache.org
Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp

I opened an issue here
https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
welcome
On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
>
> hi Roman,
>
> We would welcome adding such a document to the Arrow wiki
> https://cwiki.apache.org/confluence/display/ARROW. As to your other
> questions, it really depends on whether there is a member of the
> Parquet community who will do the work. Patches that implement any
> released functionality in the Parquet format specification are
> welcome.
>
> Thanks
> Wes
> On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
>  wrote:
> >
> > Hi everyone,
> > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > https://github.com/apache/parquet-format/pull/102
> > For parquet-cpp, this is not yet supported. I have a few questions now:
> > • is there an overview of what release of parquet-format is currently fully 
> > support in parquet-cpp (something like a feature support matrix)?
> > • how fast are new features in parquet-format adopted?
> > I think having a document describing the current completeness of 
> > implementation of the spec would be very helpful for users of the 
> > parquet-cpp library.
> > Thanks,
> > Roman
> >
> >



[jira] [Created] (ARROW-3733) [GLib] Add to_string() to GArrowTable and GArrowColumn

2018-11-08 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3733:
---

 Summary: [GLib] Add to_string() to GArrowTable and GArrowColumn
 Key: ARROW-3733
 URL: https://issues.apache.org/jira/browse/ARROW-3733
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Assign/update : NA bitmap vs sentinel

2018-11-08 Thread Phillip Cloud
There is one database that I'm aware of that uses sentinels _and_ supports
complex types with missing values: Kx's KDB+. This has led to some
seriously strange choices like the ASCII space character being used as the
sentinel value for strings. See
https://code.kx.com/wiki/Reference/Datatypes for
more details.

On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney  wrote:

> hey Matt,
>
> Thanks for giving your perspective on the mailing list.
>
> My objective in writing about this recently
> (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
> need to update since the sentinel case can be done more efficiently
> than what's there now) was to help dispel the notion that using a
> separate value (bit or byte) to encode nullness is a performance
> compromise to comply with the requirements of database systems. I too
> prefer real world benchmarks to microbenchmarks, and probably null
> checking is not going to be the main driver of aggregate system
> performance. I had heard many people over the years object to bitmaps
> on performance grounds but without analysis to back it up.
>
> Some context for other readers on the mailing list: A language like R
> is not a database and has fewer built-in scalar types: int32, double,
> string (interned), and boolean. Out of these, int32 and double can use
> one bit pattern for NA (null) and not lose too much. A database system
> generally can't make that kind of compromise, and most popular
> databases can distinguish INT32_MIN (or any other value used as a
> sentinel) and null. If you loaded data from an Avro or Parquet file
> that contained one of those values, you'd have to decide what to do
> with the data (though I understand there's integer64 add-on packages
> for R now)
>
> Now back to Arrow -- we have 3 main kinds of data types:
>
> * Fixed size primitive
> * Variable size primitive (binary, utf8)
> * Nested (list, struct, union)
>
> Out of these, "fixed size primitive" is the only one that can
> generally support O(1) in-place mutation / updates, though all of them
> could support a O(1) "make null" operation (by zeroing a bit). In
> general, when faced with designs we have preferred choices benefiting
> use cases where datasets are treated as immutable or copy-on-write.
>
> If an application _does_ need to do mutation on primitive arrays, then
> you could choose to always allocate the validity bitmap so that it can
> be mutated without requiring allocations to happen arbitrarily in your
> processing workflow. But, if you have data without nulls, it is a nice
> feature to be able to ignore the bitmap or not allocate one at all. If
> you constructed an array from data that you know to be non-nullable,
> some implementations might wish to avoid the waste of creating a
> bitmap with all 1's.
>
> For example, if we create an array::Array from a normal NumPy array of
> integers (which cannot have nulls), we have
>
> In [6]: import pyarrow as pa
> In [7]: import numpy as np
> In [8]: arr = pa.array(np.array([1, 2, 3, 4]))
>
> In [9]: arr.buffers()
> Out[9]: [None, ]
>
> In [10]: arr.null_count
> Out[10]: 0
>
> Normally, the first buffer would be the validity bitmap memory, but
> here it was not allocated because there are no nulls.
>
> Creating an open standard data representation is a difficult thing;
> one cannot be "all things to all people" but the intent is to be a
> suitable lingua franca for language agnostic data interchange and as a
> runtime representation for analytical query engines (where most
> operators are "pure"). If the Arrow community's goal were to create a
> "mutable column store" then some things might be designed differently
> (perhaps more like internals of https://kudu.apache.org/). It is
> helpful to have an understanding of what compromises have been made
> and how costly they are in real world applications.
>
> best
> Wes
> On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau  wrote:
> >
> > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle  wrote:
> >
> > > 1. I see. Good idea. Can we assume bitmap is always present in Arrow
> then?
> > > I thought I'd seen Wes argue that if there were no NAs, the bitmap
> doesn't
> > > need to be allocated.  Indeed I wasn't worried about the extra storage,
> > > although for 10,000 columns I wonder about the number of vectors.
> > >
> >
> > I think different implementations handle this differently at the moment.
> In
> > the Java code, we allocate the validity buffer at initial allocation
> > always. We're also looking to enhance the allocation strategy so the
> fixed
> > part of values are always allocated with validity (single allocation) to
> > avoid any extra object housekeeping.
> >
> >
> > > 2. It's only subjective until the code complexity is measured, then
> it's
> > > not subjective. I suppose after 20 years of using sentinels, I'm used
> to it
> > > and trust it. I'll keep an open mind on this.
> > >
> > Yup, fair enough.
> >
> >
> > > 3. Since I criticized the scale of Wes' benchmark, I felt I should

[jira] [Created] (ARROW-3732) [R] Add functions to write RecordBatch or Schema to Message value, then read back

2018-11-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3732:
---

 Summary: [R] Add functions to write RecordBatch or Schema to 
Message value, then read back
 Key: ARROW-3732
 URL: https://issues.apache.org/jira/browse/ARROW-3732
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney


Follow up work to ARROW-3499



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-08 Thread Wes McKinney
I opened an issue here
https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
welcome
On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
>
> hi Roman,
>
> We would welcome adding such a document to the Arrow wiki
> https://cwiki.apache.org/confluence/display/ARROW. As to your other
> questions, it really depends on whether there is a member of the
> Parquet community who will do the work. Patches that implement any
> released functionality in the Parquet format specification are
> welcome.
>
> Thanks
> Wes
> On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
>  wrote:
> >
> > Hi everyone,
> > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > https://github.com/apache/parquet-format/pull/102
> > For parquet-cpp, this is not yet supported. I have a few questions now:
> > • is there an overview of what release of parquet-format is currently fully 
> > support in parquet-cpp (something like a feature support matrix)?
> > • how fast are new features in parquet-format adopted?
> > I think having a document describing the current completeness of 
> > implementation of the spec would be very helpful for users of the 
> > parquet-cpp library.
> > Thanks,
> > Roman
> >
> >


[jira] [Created] (ARROW-3729) [C++] Support for writing TIMESTAMP_NANOS Parquet metadata

2018-11-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3729:
---

 Summary: [C++] Support for writing TIMESTAMP_NANOS Parquet metadata
 Key: ARROW-3729
 URL: https://issues.apache.org/jira/browse/ARROW-3729
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This was brought up on the mailing list.

We also will need to do corresponding work in the parquet-cpp library to opt in 
to writing nanosecond timestamps instead of casting to micro- or millisecond.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3731) [R] R API for reading and writing Parquet files

2018-11-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3731:
---

 Summary: [R] R API for reading and writing Parquet files
 Key: ARROW-3731
 URL: https://issues.apache.org/jira/browse/ARROW-3731
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Wes McKinney


To start, this would be at the level of complexity of 
{{pyarrow.parquet.read_table}} and {{write_table}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3730) [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script

2018-11-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3730:
---

 Summary: [Python] Output a representation of pyarrow.Schema that 
can be used to reconstruct a schema in a script
 Key: ARROW-3730
 URL: https://issues.apache.org/jira/browse/ARROW-3730
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


This would be like what {{__repr__}} is used for in many classes, or a schema 
as a list of tuples that can be passed to {{pyarrow.schema}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs

2018-11-08 Thread Li Jin
Congrats!

On Thu, Nov 8, 2018 at 4:02 PM Uwe L. Korn  wrote:

> Congratulations Krisztián!
>
> On Thu, Nov 8, 2018, at 9:56 PM, Philipp Moritz wrote:
> > Congrats and welcome Krisztián!
> >
> > On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney 
> wrote:
> >
> > > The Project Management Committee (PMC) for Apache Arrow has invited
> > > Krisztián Szűcs to become a PMC member and we are pleased to announce
> > > that he has accepted.
> > >
> > > Congratulations and welcome, Krisztián!
> > >
>


Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro

2018-11-08 Thread Li Jin
Welcome!

On Thu, Nov 8, 2018 at 4:01 PM Uwe L. Korn  wrote:

> Welcome to all of you!
>
> On Thu, Nov 8, 2018, at 8:56 PM, Wes McKinney wrote:
> > On behalf of the Arrow PMC, I'm happy to announce that Romain
> > François, Sebastien Binet, and Yosuke Shiro have been invited to be
> > committers on the project.
> >
> > Welcome, and thanks for your contributions!
>


Re: Assign/update : NA bitmap vs sentinel

2018-11-08 Thread Wes McKinney
hey Matt,

Thanks for giving your perspective on the mailing list.

My objective in writing about this recently
(http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I
need to update since the sentinel case can be done more efficiently
than what's there now) was to help dispel the notion that using a
separate value (bit or byte) to encode nullness is a performance
compromise to comply with the requirements of database systems. I too
prefer real world benchmarks to microbenchmarks, and probably null
checking is not going to be the main driver of aggregate system
performance. I had heard many people over the years object to bitmaps
on performance grounds but without analysis to back it up.

Some context for other readers on the mailing list: A language like R
is not a database and has fewer built-in scalar types: int32, double,
string (interned), and boolean. Out of these, int32 and double can use
one bit pattern for NA (null) and not lose too much. A database system
generally can't make that kind of compromise, and most popular
databases can distinguish INT32_MIN (or any other value used as a
sentinel) and null. If you loaded data from an Avro or Parquet file
that contained one of those values, you'd have to decide what to do
with the data (though I understand there's integer64 add-on packages
for R now)

Now back to Arrow -- we have 3 main kinds of data types:

* Fixed size primitive
* Variable size primitive (binary, utf8)
* Nested (list, struct, union)

Out of these, "fixed size primitive" is the only one that can
generally support O(1) in-place mutation / updates, though all of them
could support a O(1) "make null" operation (by zeroing a bit). In
general, when faced with designs we have preferred choices benefiting
use cases where datasets are treated as immutable or copy-on-write.

If an application _does_ need to do mutation on primitive arrays, then
you could choose to always allocate the validity bitmap so that it can
be mutated without requiring allocations to happen arbitrarily in your
processing workflow. But, if you have data without nulls, it is a nice
feature to be able to ignore the bitmap or not allocate one at all. If
you constructed an array from data that you know to be non-nullable,
some implementations might wish to avoid the waste of creating a
bitmap with all 1's.

For example, if we create an array::Array from a normal NumPy array of
integers (which cannot have nulls), we have

In [6]: import pyarrow as pa
In [7]: import numpy as np
In [8]: arr = pa.array(np.array([1, 2, 3, 4]))

In [9]: arr.buffers()
Out[9]: [None, ]

In [10]: arr.null_count
Out[10]: 0

Normally, the first buffer would be the validity bitmap memory, but
here it was not allocated because there are no nulls.

Creating an open standard data representation is a difficult thing;
one cannot be "all things to all people" but the intent is to be a
suitable lingua franca for language agnostic data interchange and as a
runtime representation for analytical query engines (where most
operators are "pure"). If the Arrow community's goal were to create a
"mutable column store" then some things might be designed differently
(perhaps more like internals of https://kudu.apache.org/). It is
helpful to have an understanding of what compromises have been made
and how costly they are in real world applications.

best
Wes
On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau  wrote:
>
> On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle  wrote:
>
> > 1. I see. Good idea. Can we assume bitmap is always present in Arrow then?
> > I thought I'd seen Wes argue that if there were no NAs, the bitmap doesn't
> > need to be allocated.  Indeed I wasn't worried about the extra storage,
> > although for 10,000 columns I wonder about the number of vectors.
> >
>
> I think different implementations handle this differently at the moment. In
> the Java code, we allocate the validity buffer at initial allocation
> always. We're also looking to enhance the allocation strategy so the fixed
> part of values are always allocated with validity (single allocation) to
> avoid any extra object housekeeping.
>
>
> > 2. It's only subjective until the code complexity is measured, then it's
> > not subjective. I suppose after 20 years of using sentinels, I'm used to it
> > and trust it. I'll keep an open mind on this.
> >
> Yup, fair enough.
>
>
> > 3. Since I criticized the scale of Wes' benchmark, I felt I should show how
> > I do benchmarks myself to show where I'm coming from. Yes none-null,
> > some-null and all-null paths offer savings. But that's the same under both
> > sentinel and bitmap approaches. Under both approaches, you just need to
> > know which case you're in. That involves storing the number of NAs in the
> > header/summary which can be done under both approaches.
> >
>
> The item we appreciate is that you can do a single comparison every 64
> values to determine which of the three cases you are in (make this a local
> decision). This means you don't 

[jira] [Created] (ARROW-3717) Add GCSFSWrapper for DaskFileSystem

2018-11-08 Thread Emmett McQuinn (JIRA)
Emmett McQuinn created ARROW-3717:
-

 Summary: Add GCSFSWrapper for DaskFileSystem
 Key: ARROW-3717
 URL: https://issues.apache.org/jira/browse/ARROW-3717
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Emmett McQuinn


Currently there is an S3FSWrapper that extends the DaskFileSystem object to 
support functionality like isdir(...), isfile(...), and walk(...).

Adding a GCSFSWrapper would enable using Google Cloud Storage for packages 
depending on arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3720) [GLib] Use "indices" instead of "indexes"

2018-11-08 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3720:
---

 Summary: [GLib] Use "indices" instead of "indexes"
 Key: ARROW-3720
 URL: https://issues.apache.org/jira/browse/ARROW-3720
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3723) [Plasma] [Ruby] Add Ruby bindings of Plasma

2018-11-08 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-3723:
---

 Summary: [Plasma] [Ruby] Add Ruby bindings of Plasma
 Key: ARROW-3723
 URL: https://issues.apache.org/jira/browse/ARROW-3723
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Plasma (C++), Ruby
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.12.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3725) [GLib] Add field readers to GArrowStructDataType

2018-11-08 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3725:
---

 Summary: [GLib] Add field readers to GArrowStructDataType
 Key: ARROW-3725
 URL: https://issues.apache.org/jira/browse/ARROW-3725
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs

2018-11-08 Thread Uwe L. Korn
Congratulations Krisztián!

On Thu, Nov 8, 2018, at 9:56 PM, Philipp Moritz wrote:
> Congrats and welcome Krisztián!
> 
> On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney  wrote:
> 
> > The Project Management Committee (PMC) for Apache Arrow has invited
> > Krisztián Szűcs to become a PMC member and we are pleased to announce
> > that he has accepted.
> >
> > Congratulations and welcome, Krisztián!
> >


Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro

2018-11-08 Thread Uwe L. Korn
Welcome to all of you!

On Thu, Nov 8, 2018, at 8:56 PM, Wes McKinney wrote:
> On behalf of the Arrow PMC, I'm happy to announce that Romain
> François, Sebastien Binet, and Yosuke Shiro have been invited to be
> committers on the project.
> 
> Welcome, and thanks for your contributions!


Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro

2018-11-08 Thread Antoine Pitrou


It's nice to have new people onboard.  Welcome everyone :-)

Le 08/11/2018 à 20:56, Wes McKinney a écrit :
> On behalf of the Arrow PMC, I'm happy to announce that Romain
> François, Sebastien Binet, and Yosuke Shiro have been invited to be
> committers on the project.
> 
> Welcome, and thanks for your contributions!
> 


Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs

2018-11-08 Thread Philipp Moritz
Congrats and welcome Krisztián!

On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney  wrote:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Krisztián Szűcs to become a PMC member and we are pleased to announce
> that he has accepted.
>
> Congratulations and welcome, Krisztián!
>


[jira] [Created] (ARROW-3724) [GLib] Update gitignore

2018-11-08 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-3724:
---

 Summary: [GLib] Update gitignore
 Key: ARROW-3724
 URL: https://issues.apache.org/jira/browse/ARROW-3724
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.12.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[ANNOUNCE] New Arrow PMC member: Krisztián Szűcs

2018-11-08 Thread Wes McKinney
The Project Management Committee (PMC) for Apache Arrow has invited
Krisztián Szűcs to become a PMC member and we are pleased to announce
that he has accepted.

Congratulations and welcome, Krisztián!


Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Wes McKinney
I opened https://issues.apache.org/jira/browse/ARROW-3727 about adding
examples. I will mention to add an example for CUDA also
On Thu, Nov 8, 2018 at 2:30 PM Randy Zwitch  wrote:
>
> Thanks Uwe, Wes, Pearu and Antoine. This is in the pyarrow docs, but no
> example, so I'll open up a JIRA so that it might be more obvious the
> next person.
>
>
> On 11/8/18 12:59 PM, Uwe L. Korn wrote:
> > Hello Randy,
> >
> > you are looking for 
> > https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer
> >  This takes an address, size and a Python object for having a reference on 
> > the object. In your case the last one can be None. Note that this will not 
> > do a copy and thus directly reference the shared memory.
> >
> > Uwe
> >
> > On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote:
> >> Within OmniSci (MapD), we have the following code that takes a pointer
> >> and length and reads to a NumPy array before calling py_buffer:
> >>
> >> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52
> >>
> >> Is it possible to eliminate the NumPy step and go directly do an Arrow
> >> buffer? There is both a concern that we're doing an unnecessary memory
> >> copy, as well as wanting to defer to the Arrow way of doing things as
> >> much as possible rather than having our own shims like these.
> >>


Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Randy Zwitch
Thanks Uwe, Wes, Pearu and Antoine. This is in the pyarrow docs, but no 
example, so I'll open up a JIRA so that it might be more obvious the 
next person.



On 11/8/18 12:59 PM, Uwe L. Korn wrote:

Hello Randy,

you are looking for 
https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer
 This takes an address, size and a Python object for having a reference on the 
object. In your case the last one can be None. Note that this will not do a 
copy and thus directly reference the shared memory.

Uwe

On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote:

Within OmniSci (MapD), we have the following code that takes a pointer
and length and reads to a NumPy array before calling py_buffer:

https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52

Is it possible to eliminate the NumPy step and go directly do an Arrow
buffer? There is both a concern that we're doing an unnecessary memory
copy, as well as wanting to defer to the Arrow way of doing things as
much as possible rather than having our own shims like these.



Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Uwe L. Korn
Hello Randy,

you are looking for 
https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer
 This takes an address, size and a Python object for having a reference on the 
object. In your case the last one can be None. Note that this will not do a 
copy and thus directly reference the shared memory.

Uwe

On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote:
> Within OmniSci (MapD), we have the following code that takes a pointer 
> and length and reads to a NumPy array before calling py_buffer:
> 
> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52
> 
> Is it possible to eliminate the NumPy step and go directly do an Arrow 
> buffer? There is both a concern that we're doing an unnecessary memory 
> copy, as well as wanting to defer to the Arrow way of doing things as 
> much as possible rather than having our own shims like these.
> 


[jira] [Created] (ARROW-3728) Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-08 Thread Micah Williamson (JIRA)
Micah Williamson created ARROW-3728:
---

 Summary: Merging Parquet Files - Pandas Meta in Schema Mismatch
 Key: ARROW-3728
 URL: https://issues.apache.org/jira/browse/ARROW-3728
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1, 0.11.0, 0.10.0
 Environment: Python 3.6.3
OSX 10.14
Reporter: Micah Williamson


From: 
https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
 
I am trying to merge multiple parquet files into one. Their schemas are 
identical field-wise but my {{ParquetWriter}} is complaining that they are not. 
After some investigation I found that the pandas meta in the schemas are 
different, causing this error.
 
Sample-

{code:python}
import pyarrow.parquet as pq

pq_tables=[]
for file_ in files:
pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
pq_tables.append(pq_table)
if writer is None:
writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
use_deprecated_int96_timestamps=True)
writer.write_table(table=pq_table)
{code}

The error-

{code}
Traceback (most recent call last):
  File "{PATH_TO}/main.py", line 68, in lambda_handler
writer.write_table(table=pq_table)
  File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
 line 335, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3726) [Rust] CSV Reader & Writer

2018-11-08 Thread nevi_me (JIRA)
nevi_me created ARROW-3726:
--

 Summary: [Rust] CSV Reader & Writer
 Key: ARROW-3726
 URL: https://issues.apache.org/jira/browse/ARROW-3726
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: nevi_me


As an Arrow Rust user, I would like to be able to read and write CSV files, so 
that I can quickly ingest data into an Arrow format for futher use, and save 
outputs in CSV.

As there aren't yet many options for working with tabular/df structures in Rust 
(other than Andy's DataFusion), I'm struggling to motivate for this feature. 
However, I think building a csv parser into Rust would reduce effort for future 
libs (incl DataFusion).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3727) [Python] Document use of pyarrow.foreign_buffer in Sphinx

2018-11-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3727:
---

 Summary: [Python] Document use of pyarrow.foreign_buffer in Sphinx
 Key: ARROW-3727
 URL: https://issues.apache.org/jira/browse/ARROW-3727
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.12.0


This could be called out as a major section in 
http://arrow.apache.org/docs/python/memory.html for better discoverability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Pearu Peterson
Hi,

For host memory, you can use pyarrow.foreign_buffer, see
  https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html

For device memory, one can use pyarrow.cuda.foreign_buffer.

HTH,
Pearu


On Thu, Nov 8, 2018 at 7:53 PM Randy Zwitch 
wrote:

> Within OmniSci (MapD), we have the following code that takes a pointer
> and length and reads to a NumPy array before calling py_buffer:
>
> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52
>
> Is it possible to eliminate the NumPy step and go directly do an Arrow
> buffer? There is both a concern that we're doing an unnecessary memory
> copy, as well as wanting to defer to the Arrow way of doing things as
> much as possible rather than having our own shims like these.
>
>


Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Wes McKinney
Yes, see pyarrow.foreign_buffer

If this isn't in the documentation, could you open a JIRA to fix that?

Thanks
Wes

On Thu, Nov 8, 2018, 11:53 AM Randy Zwitch  Within OmniSci (MapD), we have the following code that takes a pointer
> and length and reads to a NumPy array before calling py_buffer:
>
> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52
>
> Is it possible to eliminate the NumPy step and go directly do an Arrow
> buffer? There is both a concern that we're doing an unnecessary memory
> copy, as well as wanting to defer to the Arrow way of doing things as
> much as possible rather than having our own shims like these.
>
>


Re: Creating Buffer directly from pointer/length

2018-11-08 Thread Antoine Pitrou


You should be able to use pa.foreign_buffer():
https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer

Regards

Antoine.


Le 08/11/2018 à 18:49, Randy Zwitch a écrit :
> Within OmniSci (MapD), we have the following code that takes a pointer 
> and length and reads to a NumPy array before calling py_buffer:
> 
> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52
> 
> Is it possible to eliminate the NumPy step and go directly do an Arrow 
> buffer? There is both a concern that we're doing an unnecessary memory 
> copy, as well as wanting to defer to the Arrow way of doing things as 
> much as possible rather than having our own shims like these.
> 


Creating Buffer directly from pointer/length

2018-11-08 Thread Randy Zwitch
Within OmniSci (MapD), we have the following code that takes a pointer 
and length and reads to a NumPy array before calling py_buffer:


https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52

Is it possible to eliminate the NumPy step and go directly do an Arrow 
buffer? There is both a concern that we're doing an unnecessary memory 
copy, as well as wanting to defer to the Arrow way of doing things as 
much as possible rather than having our own shims like these.




[jira] [Created] (ARROW-3718) [Gandiva] Remove spurious gtest include

2018-11-08 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3718:
-

 Summary: [Gandiva] Remove spurious gtest include
 Key: ARROW-3718
 URL: https://issues.apache.org/jira/browse/ARROW-3718
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Gandiva
Affects Versions: 0.11.1
Reporter: Philipp Moritz
 Fix For: 0.12.0


At the moment, cpp/src/gandiva/expr_decomposer.h includes a gtest header which 
can prevent gandiva to be built without the gtest dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3722) [C++] Allow specifying column types to CSV reader

2018-11-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3722:
-

 Summary: [C++] Allow specifying column types to CSV reader
 Key: ARROW-3722
 URL: https://issues.apache.org/jira/browse/ARROW-3722
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.1
Reporter: Antoine Pitrou


I'm not sure how to expose this. The easiest, implementation-wise, would be to 
allow passing a {{Schema}} (for example inside the {{ConvertOptions}}).

Another possibility is to allow specifying the default types for type 
inference. For example type inference currently infers integers as {{int64}}, 
but the user might prefer {{int32}}.

Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3721) [Gandiva] [Python] Support all Gandiva literals

2018-11-08 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3721:
-

 Summary: [Gandiva] [Python] Support all Gandiva literals
 Key: ARROW-3721
 URL: https://issues.apache.org/jira/browse/ARROW-3721
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


Support all the literals from 
[https://github.com/apache/arrow/blob/5b116ab175292fe70ed3c8727bcc6868b9695f4a/cpp/src/gandiva/tree_expr_builder.h#L35]
 in the Cython bindings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)