AW: Support for TIMESTAMP_NANOS in parquet-cpp
I would be willing to implement that. I’ll probably need some advice on my patch though, as I’m fairly new to the parquet code. Roman Von: Wes McKinney Gesendet: Donnerstag, 8. November 2018 23:22 An: dev@arrow.apache.org Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp I opened an issue here https://issues.apache.org/jira/browse/ARROW-3729. Patches would be welcome On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > hi Roman, > > We would welcome adding such a document to the Arrow wiki > https://cwiki.apache.org/confluence/display/ARROW. As to your other > questions, it really depends on whether there is a member of the > Parquet community who will do the work. Patches that implement any > released functionality in the Parquet format specification are > welcome. > > Thanks > Wes > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > wrote: > > > > Hi everyone, > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > https://github.com/apache/parquet-format/pull/102 > > For parquet-cpp, this is not yet supported. I have a few questions now: > > • is there an overview of what release of parquet-format is currently fully > > support in parquet-cpp (something like a feature support matrix)? > > • how fast are new features in parquet-format adopted? > > I think having a document describing the current completeness of > > implementation of the spec would be very helpful for users of the > > parquet-cpp library. > > Thanks, > > Roman > > > >
[jira] [Created] (ARROW-3733) [GLib] Add to_string() to GArrowTable and GArrowColumn
Kouhei Sutou created ARROW-3733: --- Summary: [GLib] Add to_string() to GArrowTable and GArrowColumn Key: ARROW-3733 URL: https://issues.apache.org/jira/browse/ARROW-3733 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Assign/update : NA bitmap vs sentinel
There is one database that I'm aware of that uses sentinels _and_ supports complex types with missing values: Kx's KDB+. This has led to some seriously strange choices like the ASCII space character being used as the sentinel value for strings. See https://code.kx.com/wiki/Reference/Datatypes for more details. On Thu, Nov 8, 2018 at 4:39 PM Wes McKinney wrote: > hey Matt, > > Thanks for giving your perspective on the mailing list. > > My objective in writing about this recently > (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I > need to update since the sentinel case can be done more efficiently > than what's there now) was to help dispel the notion that using a > separate value (bit or byte) to encode nullness is a performance > compromise to comply with the requirements of database systems. I too > prefer real world benchmarks to microbenchmarks, and probably null > checking is not going to be the main driver of aggregate system > performance. I had heard many people over the years object to bitmaps > on performance grounds but without analysis to back it up. > > Some context for other readers on the mailing list: A language like R > is not a database and has fewer built-in scalar types: int32, double, > string (interned), and boolean. Out of these, int32 and double can use > one bit pattern for NA (null) and not lose too much. A database system > generally can't make that kind of compromise, and most popular > databases can distinguish INT32_MIN (or any other value used as a > sentinel) and null. If you loaded data from an Avro or Parquet file > that contained one of those values, you'd have to decide what to do > with the data (though I understand there's integer64 add-on packages > for R now) > > Now back to Arrow -- we have 3 main kinds of data types: > > * Fixed size primitive > * Variable size primitive (binary, utf8) > * Nested (list, struct, union) > > Out of these, "fixed size primitive" is the only one that can > generally support O(1) in-place mutation / updates, though all of them > could support a O(1) "make null" operation (by zeroing a bit). In > general, when faced with designs we have preferred choices benefiting > use cases where datasets are treated as immutable or copy-on-write. > > If an application _does_ need to do mutation on primitive arrays, then > you could choose to always allocate the validity bitmap so that it can > be mutated without requiring allocations to happen arbitrarily in your > processing workflow. But, if you have data without nulls, it is a nice > feature to be able to ignore the bitmap or not allocate one at all. If > you constructed an array from data that you know to be non-nullable, > some implementations might wish to avoid the waste of creating a > bitmap with all 1's. > > For example, if we create an array::Array from a normal NumPy array of > integers (which cannot have nulls), we have > > In [6]: import pyarrow as pa > In [7]: import numpy as np > In [8]: arr = pa.array(np.array([1, 2, 3, 4])) > > In [9]: arr.buffers() > Out[9]: [None, ] > > In [10]: arr.null_count > Out[10]: 0 > > Normally, the first buffer would be the validity bitmap memory, but > here it was not allocated because there are no nulls. > > Creating an open standard data representation is a difficult thing; > one cannot be "all things to all people" but the intent is to be a > suitable lingua franca for language agnostic data interchange and as a > runtime representation for analytical query engines (where most > operators are "pure"). If the Arrow community's goal were to create a > "mutable column store" then some things might be designed differently > (perhaps more like internals of https://kudu.apache.org/). It is > helpful to have an understanding of what compromises have been made > and how costly they are in real world applications. > > best > Wes > On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau wrote: > > > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle wrote: > > > > > 1. I see. Good idea. Can we assume bitmap is always present in Arrow > then? > > > I thought I'd seen Wes argue that if there were no NAs, the bitmap > doesn't > > > need to be allocated. Indeed I wasn't worried about the extra storage, > > > although for 10,000 columns I wonder about the number of vectors. > > > > > > > I think different implementations handle this differently at the moment. > In > > the Java code, we allocate the validity buffer at initial allocation > > always. We're also looking to enhance the allocation strategy so the > fixed > > part of values are always allocated with validity (single allocation) to > > avoid any extra object housekeeping. > > > > > > > 2. It's only subjective until the code complexity is measured, then > it's > > > not subjective. I suppose after 20 years of using sentinels, I'm used > to it > > > and trust it. I'll keep an open mind on this. > > > > > Yup, fair enough. > > > > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I should
[jira] [Created] (ARROW-3732) [R] Add functions to write RecordBatch or Schema to Message value, then read back
Wes McKinney created ARROW-3732: --- Summary: [R] Add functions to write RecordBatch or Schema to Message value, then read back Key: ARROW-3732 URL: https://issues.apache.org/jira/browse/ARROW-3732 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Wes McKinney Follow up work to ARROW-3499 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Support for TIMESTAMP_NANOS in parquet-cpp
I opened an issue here https://issues.apache.org/jira/browse/ARROW-3729. Patches would be welcome On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > hi Roman, > > We would welcome adding such a document to the Arrow wiki > https://cwiki.apache.org/confluence/display/ARROW. As to your other > questions, it really depends on whether there is a member of the > Parquet community who will do the work. Patches that implement any > released functionality in the Parquet format specification are > welcome. > > Thanks > Wes > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > wrote: > > > > Hi everyone, > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > https://github.com/apache/parquet-format/pull/102 > > For parquet-cpp, this is not yet supported. I have a few questions now: > > • is there an overview of what release of parquet-format is currently fully > > support in parquet-cpp (something like a feature support matrix)? > > • how fast are new features in parquet-format adopted? > > I think having a document describing the current completeness of > > implementation of the spec would be very helpful for users of the > > parquet-cpp library. > > Thanks, > > Roman > > > >
[jira] [Created] (ARROW-3729) [C++] Support for writing TIMESTAMP_NANOS Parquet metadata
Wes McKinney created ARROW-3729: --- Summary: [C++] Support for writing TIMESTAMP_NANOS Parquet metadata Key: ARROW-3729 URL: https://issues.apache.org/jira/browse/ARROW-3729 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This was brought up on the mailing list. We also will need to do corresponding work in the parquet-cpp library to opt in to writing nanosecond timestamps instead of casting to micro- or millisecond. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3731) [R] R API for reading and writing Parquet files
Wes McKinney created ARROW-3731: --- Summary: [R] R API for reading and writing Parquet files Key: ARROW-3731 URL: https://issues.apache.org/jira/browse/ARROW-3731 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Wes McKinney To start, this would be at the level of complexity of {{pyarrow.parquet.read_table}} and {{write_table}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3730) [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script
Wes McKinney created ARROW-3730: --- Summary: [Python] Output a representation of pyarrow.Schema that can be used to reconstruct a schema in a script Key: ARROW-3730 URL: https://issues.apache.org/jira/browse/ARROW-3730 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney This would be like what {{__repr__}} is used for in many classes, or a schema as a list of tuples that can be passed to {{pyarrow.schema}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs
Congrats! On Thu, Nov 8, 2018 at 4:02 PM Uwe L. Korn wrote: > Congratulations Krisztián! > > On Thu, Nov 8, 2018, at 9:56 PM, Philipp Moritz wrote: > > Congrats and welcome Krisztián! > > > > On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney > wrote: > > > > > The Project Management Committee (PMC) for Apache Arrow has invited > > > Krisztián Szűcs to become a PMC member and we are pleased to announce > > > that he has accepted. > > > > > > Congratulations and welcome, Krisztián! > > > >
Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro
Welcome! On Thu, Nov 8, 2018 at 4:01 PM Uwe L. Korn wrote: > Welcome to all of you! > > On Thu, Nov 8, 2018, at 8:56 PM, Wes McKinney wrote: > > On behalf of the Arrow PMC, I'm happy to announce that Romain > > François, Sebastien Binet, and Yosuke Shiro have been invited to be > > committers on the project. > > > > Welcome, and thanks for your contributions! >
Re: Assign/update : NA bitmap vs sentinel
hey Matt, Thanks for giving your perspective on the mailing list. My objective in writing about this recently (http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/, though I need to update since the sentinel case can be done more efficiently than what's there now) was to help dispel the notion that using a separate value (bit or byte) to encode nullness is a performance compromise to comply with the requirements of database systems. I too prefer real world benchmarks to microbenchmarks, and probably null checking is not going to be the main driver of aggregate system performance. I had heard many people over the years object to bitmaps on performance grounds but without analysis to back it up. Some context for other readers on the mailing list: A language like R is not a database and has fewer built-in scalar types: int32, double, string (interned), and boolean. Out of these, int32 and double can use one bit pattern for NA (null) and not lose too much. A database system generally can't make that kind of compromise, and most popular databases can distinguish INT32_MIN (or any other value used as a sentinel) and null. If you loaded data from an Avro or Parquet file that contained one of those values, you'd have to decide what to do with the data (though I understand there's integer64 add-on packages for R now) Now back to Arrow -- we have 3 main kinds of data types: * Fixed size primitive * Variable size primitive (binary, utf8) * Nested (list, struct, union) Out of these, "fixed size primitive" is the only one that can generally support O(1) in-place mutation / updates, though all of them could support a O(1) "make null" operation (by zeroing a bit). In general, when faced with designs we have preferred choices benefiting use cases where datasets are treated as immutable or copy-on-write. If an application _does_ need to do mutation on primitive arrays, then you could choose to always allocate the validity bitmap so that it can be mutated without requiring allocations to happen arbitrarily in your processing workflow. But, if you have data without nulls, it is a nice feature to be able to ignore the bitmap or not allocate one at all. If you constructed an array from data that you know to be non-nullable, some implementations might wish to avoid the waste of creating a bitmap with all 1's. For example, if we create an array::Array from a normal NumPy array of integers (which cannot have nulls), we have In [6]: import pyarrow as pa In [7]: import numpy as np In [8]: arr = pa.array(np.array([1, 2, 3, 4])) In [9]: arr.buffers() Out[9]: [None, ] In [10]: arr.null_count Out[10]: 0 Normally, the first buffer would be the validity bitmap memory, but here it was not allocated because there are no nulls. Creating an open standard data representation is a difficult thing; one cannot be "all things to all people" but the intent is to be a suitable lingua franca for language agnostic data interchange and as a runtime representation for analytical query engines (where most operators are "pure"). If the Arrow community's goal were to create a "mutable column store" then some things might be designed differently (perhaps more like internals of https://kudu.apache.org/). It is helpful to have an understanding of what compromises have been made and how costly they are in real world applications. best Wes On Mon, Nov 5, 2018 at 8:27 PM Jacques Nadeau wrote: > > On Mon, Nov 5, 2018 at 3:43 PM Matt Dowle wrote: > > > 1. I see. Good idea. Can we assume bitmap is always present in Arrow then? > > I thought I'd seen Wes argue that if there were no NAs, the bitmap doesn't > > need to be allocated. Indeed I wasn't worried about the extra storage, > > although for 10,000 columns I wonder about the number of vectors. > > > > I think different implementations handle this differently at the moment. In > the Java code, we allocate the validity buffer at initial allocation > always. We're also looking to enhance the allocation strategy so the fixed > part of values are always allocated with validity (single allocation) to > avoid any extra object housekeeping. > > > > 2. It's only subjective until the code complexity is measured, then it's > > not subjective. I suppose after 20 years of using sentinels, I'm used to it > > and trust it. I'll keep an open mind on this. > > > Yup, fair enough. > > > > 3. Since I criticized the scale of Wes' benchmark, I felt I should show how > > I do benchmarks myself to show where I'm coming from. Yes none-null, > > some-null and all-null paths offer savings. But that's the same under both > > sentinel and bitmap approaches. Under both approaches, you just need to > > know which case you're in. That involves storing the number of NAs in the > > header/summary which can be done under both approaches. > > > > The item we appreciate is that you can do a single comparison every 64 > values to determine which of the three cases you are in (make this a local > decision). This means you don't
[jira] [Created] (ARROW-3717) Add GCSFSWrapper for DaskFileSystem
Emmett McQuinn created ARROW-3717: - Summary: Add GCSFSWrapper for DaskFileSystem Key: ARROW-3717 URL: https://issues.apache.org/jira/browse/ARROW-3717 Project: Apache Arrow Issue Type: New Feature Reporter: Emmett McQuinn Currently there is an S3FSWrapper that extends the DaskFileSystem object to support functionality like isdir(...), isfile(...), and walk(...). Adding a GCSFSWrapper would enable using Google Cloud Storage for packages depending on arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3720) [GLib] Use "indices" instead of "indexes"
Kouhei Sutou created ARROW-3720: --- Summary: [GLib] Use "indices" instead of "indexes" Key: ARROW-3720 URL: https://issues.apache.org/jira/browse/ARROW-3720 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3723) [Plasma] [Ruby] Add Ruby bindings of Plasma
Yosuke Shiro created ARROW-3723: --- Summary: [Plasma] [Ruby] Add Ruby bindings of Plasma Key: ARROW-3723 URL: https://issues.apache.org/jira/browse/ARROW-3723 Project: Apache Arrow Issue Type: New Feature Components: Plasma (C++), Ruby Reporter: Yosuke Shiro Assignee: Yosuke Shiro Fix For: 0.12.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3725) [GLib] Add field readers to GArrowStructDataType
Kouhei Sutou created ARROW-3725: --- Summary: [GLib] Add field readers to GArrowStructDataType Key: ARROW-3725 URL: https://issues.apache.org/jira/browse/ARROW-3725 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs
Congratulations Krisztián! On Thu, Nov 8, 2018, at 9:56 PM, Philipp Moritz wrote: > Congrats and welcome Krisztián! > > On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney wrote: > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Krisztián Szűcs to become a PMC member and we are pleased to announce > > that he has accepted. > > > > Congratulations and welcome, Krisztián! > >
Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro
Welcome to all of you! On Thu, Nov 8, 2018, at 8:56 PM, Wes McKinney wrote: > On behalf of the Arrow PMC, I'm happy to announce that Romain > François, Sebastien Binet, and Yosuke Shiro have been invited to be > committers on the project. > > Welcome, and thanks for your contributions!
Re: [ANNOUNCE] New Arrow committers: Romain François, Sebastien Binet, Yosuke Shiro
It's nice to have new people onboard. Welcome everyone :-) Le 08/11/2018 à 20:56, Wes McKinney a écrit : > On behalf of the Arrow PMC, I'm happy to announce that Romain > François, Sebastien Binet, and Yosuke Shiro have been invited to be > committers on the project. > > Welcome, and thanks for your contributions! >
Re: [ANNOUNCE] New Arrow PMC member: Krisztián Szűcs
Congrats and welcome Krisztián! On Thu, Nov 8, 2018 at 11:48 AM Wes McKinney wrote: > The Project Management Committee (PMC) for Apache Arrow has invited > Krisztián Szűcs to become a PMC member and we are pleased to announce > that he has accepted. > > Congratulations and welcome, Krisztián! >
[jira] [Created] (ARROW-3724) [GLib] Update gitignore
Yosuke Shiro created ARROW-3724: --- Summary: [GLib] Update gitignore Key: ARROW-3724 URL: https://issues.apache.org/jira/browse/ARROW-3724 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro Fix For: 0.12.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[ANNOUNCE] New Arrow PMC member: Krisztián Szűcs
The Project Management Committee (PMC) for Apache Arrow has invited Krisztián Szűcs to become a PMC member and we are pleased to announce that he has accepted. Congratulations and welcome, Krisztián!
Re: Creating Buffer directly from pointer/length
I opened https://issues.apache.org/jira/browse/ARROW-3727 about adding examples. I will mention to add an example for CUDA also On Thu, Nov 8, 2018 at 2:30 PM Randy Zwitch wrote: > > Thanks Uwe, Wes, Pearu and Antoine. This is in the pyarrow docs, but no > example, so I'll open up a JIRA so that it might be more obvious the > next person. > > > On 11/8/18 12:59 PM, Uwe L. Korn wrote: > > Hello Randy, > > > > you are looking for > > https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer > > This takes an address, size and a Python object for having a reference on > > the object. In your case the last one can be None. Note that this will not > > do a copy and thus directly reference the shared memory. > > > > Uwe > > > > On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote: > >> Within OmniSci (MapD), we have the following code that takes a pointer > >> and length and reads to a NumPy array before calling py_buffer: > >> > >> https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 > >> > >> Is it possible to eliminate the NumPy step and go directly do an Arrow > >> buffer? There is both a concern that we're doing an unnecessary memory > >> copy, as well as wanting to defer to the Arrow way of doing things as > >> much as possible rather than having our own shims like these. > >>
Re: Creating Buffer directly from pointer/length
Thanks Uwe, Wes, Pearu and Antoine. This is in the pyarrow docs, but no example, so I'll open up a JIRA so that it might be more obvious the next person. On 11/8/18 12:59 PM, Uwe L. Korn wrote: Hello Randy, you are looking for https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer This takes an address, size and a Python object for having a reference on the object. In your case the last one can be None. Note that this will not do a copy and thus directly reference the shared memory. Uwe On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote: Within OmniSci (MapD), we have the following code that takes a pointer and length and reads to a NumPy array before calling py_buffer: https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 Is it possible to eliminate the NumPy step and go directly do an Arrow buffer? There is both a concern that we're doing an unnecessary memory copy, as well as wanting to defer to the Arrow way of doing things as much as possible rather than having our own shims like these.
Re: Creating Buffer directly from pointer/length
Hello Randy, you are looking for https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer This takes an address, size and a Python object for having a reference on the object. In your case the last one can be None. Note that this will not do a copy and thus directly reference the shared memory. Uwe On Thu, Nov 8, 2018, at 6:49 PM, Randy Zwitch wrote: > Within OmniSci (MapD), we have the following code that takes a pointer > and length and reads to a NumPy array before calling py_buffer: > > https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 > > Is it possible to eliminate the NumPy step and go directly do an Arrow > buffer? There is both a concern that we're doing an unnecessary memory > copy, as well as wanting to defer to the Arrow way of doing things as > much as possible rather than having our own shims like these. >
[jira] [Created] (ARROW-3728) Merging Parquet Files - Pandas Meta in Schema Mismatch
Micah Williamson created ARROW-3728: --- Summary: Merging Parquet Files - Pandas Meta in Schema Mismatch Key: ARROW-3728 URL: https://issues.apache.org/jira/browse/ARROW-3728 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1, 0.11.0, 0.10.0 Environment: Python 3.6.3 OSX 10.14 Reporter: Micah Williamson From: https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch I am trying to merge multiple parquet files into one. Their schemas are identical field-wise but my {{ParquetWriter}} is complaining that they are not. After some investigation I found that the pandas meta in the schemas are different, causing this error. Sample- {code:python} import pyarrow.parquet as pq pq_tables=[] for file_ in files: pq_table = pq.read_table(f'{MESS_DIR}/{file_}') pq_tables.append(pq_table) if writer is None: writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, use_deprecated_int96_timestamps=True) writer.write_table(table=pq_table) {code} The error- {code} Traceback (most recent call last): File "{PATH_TO}/main.py", line 68, in lambda_handler writer.write_table(table=pq_table) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", line 335, in write_table raise ValueError(msg) ValueError: Table schema does not match schema used to create file: {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3726) [Rust] CSV Reader & Writer
nevi_me created ARROW-3726: -- Summary: [Rust] CSV Reader & Writer Key: ARROW-3726 URL: https://issues.apache.org/jira/browse/ARROW-3726 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: nevi_me As an Arrow Rust user, I would like to be able to read and write CSV files, so that I can quickly ingest data into an Arrow format for futher use, and save outputs in CSV. As there aren't yet many options for working with tabular/df structures in Rust (other than Andy's DataFusion), I'm struggling to motivate for this feature. However, I think building a csv parser into Rust would reduce effort for future libs (incl DataFusion). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3727) [Python] Document use of pyarrow.foreign_buffer in Sphinx
Wes McKinney created ARROW-3727: --- Summary: [Python] Document use of pyarrow.foreign_buffer in Sphinx Key: ARROW-3727 URL: https://issues.apache.org/jira/browse/ARROW-3727 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.12.0 This could be called out as a major section in http://arrow.apache.org/docs/python/memory.html for better discoverability -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Creating Buffer directly from pointer/length
Hi, For host memory, you can use pyarrow.foreign_buffer, see https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html For device memory, one can use pyarrow.cuda.foreign_buffer. HTH, Pearu On Thu, Nov 8, 2018 at 7:53 PM Randy Zwitch wrote: > Within OmniSci (MapD), we have the following code that takes a pointer > and length and reads to a NumPy array before calling py_buffer: > > https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 > > Is it possible to eliminate the NumPy step and go directly do an Arrow > buffer? There is both a concern that we're doing an unnecessary memory > copy, as well as wanting to defer to the Arrow way of doing things as > much as possible rather than having our own shims like these. > >
Re: Creating Buffer directly from pointer/length
Yes, see pyarrow.foreign_buffer If this isn't in the documentation, could you open a JIRA to fix that? Thanks Wes On Thu, Nov 8, 2018, 11:53 AM Randy Zwitch Within OmniSci (MapD), we have the following code that takes a pointer > and length and reads to a NumPy array before calling py_buffer: > > https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 > > Is it possible to eliminate the NumPy step and go directly do an Arrow > buffer? There is both a concern that we're doing an unnecessary memory > copy, as well as wanting to defer to the Arrow way of doing things as > much as possible rather than having our own shims like these. > >
Re: Creating Buffer directly from pointer/length
You should be able to use pa.foreign_buffer(): https://arrow.apache.org/docs/python/generated/pyarrow.foreign_buffer.html#pyarrow.foreign_buffer Regards Antoine. Le 08/11/2018 à 18:49, Randy Zwitch a écrit : > Within OmniSci (MapD), we have the following code that takes a pointer > and length and reads to a NumPy array before calling py_buffer: > > https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 > > Is it possible to eliminate the NumPy step and go directly do an Arrow > buffer? There is both a concern that we're doing an unnecessary memory > copy, as well as wanting to defer to the Arrow way of doing things as > much as possible rather than having our own shims like these. >
Creating Buffer directly from pointer/length
Within OmniSci (MapD), we have the following code that takes a pointer and length and reads to a NumPy array before calling py_buffer: https://github.com/omnisci/pymapd/blob/master/pymapd/shm.pyx#L31-L52 Is it possible to eliminate the NumPy step and go directly do an Arrow buffer? There is both a concern that we're doing an unnecessary memory copy, as well as wanting to defer to the Arrow way of doing things as much as possible rather than having our own shims like these.
[jira] [Created] (ARROW-3718) [Gandiva] Remove spurious gtest include
Philipp Moritz created ARROW-3718: - Summary: [Gandiva] Remove spurious gtest include Key: ARROW-3718 URL: https://issues.apache.org/jira/browse/ARROW-3718 Project: Apache Arrow Issue Type: Improvement Components: Gandiva Affects Versions: 0.11.1 Reporter: Philipp Moritz Fix For: 0.12.0 At the moment, cpp/src/gandiva/expr_decomposer.h includes a gtest header which can prevent gandiva to be built without the gtest dependency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3722) [C++] Allow specifying column types to CSV reader
Antoine Pitrou created ARROW-3722: - Summary: [C++] Allow specifying column types to CSV reader Key: ARROW-3722 URL: https://issues.apache.org/jira/browse/ARROW-3722 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.1 Reporter: Antoine Pitrou I'm not sure how to expose this. The easiest, implementation-wise, would be to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}). Another possibility is to allow specifying the default types for type inference. For example type inference currently infers integers as {{int64}}, but the user might prefer {{int32}}. Thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3721) [Gandiva] [Python] Support all Gandiva literals
Philipp Moritz created ARROW-3721: - Summary: [Gandiva] [Python] Support all Gandiva literals Key: ARROW-3721 URL: https://issues.apache.org/jira/browse/ARROW-3721 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz Support all the literals from [https://github.com/apache/arrow/blob/5b116ab175292fe70ed3c8727bcc6868b9695f4a/cpp/src/gandiva/tree_expr_builder.h#L35] in the Cython bindings. -- This message was sent by Atlassian JIRA (v7.6.3#76005)