Re: Efficiently allocating an empty vector (python)

2019-12-11 Thread Ted Gooch
Not sure if this is any better, but I have an open PR right now in Iceberg,
where we are doing something similar:
https://github.com/apache/incubator-iceberg/pull/544/commits/28166fd3f0e3a24863048a2721f1ae69f243e2af#diff-51d6edf951c105e1e62a3f1e8b4640aaR319-R341

@staticmethod
def create_null_column(reference_column, name, dtype_tuple):
dtype, init_val = dtype_tuple
chunk = pa.chunked_array([pa.array(np.full(len(c), init_val),
type=dtype, mask=[True] * len(c))
  for c in reference_column.data.chunks],
type=dtype)

return pa.Column.from_array(name, chunk)


Note, that this is using the <0.15 column API, which has been deprecated.

On Wed, Dec 11, 2019 at 10:36 AM Weston Pace  wrote:

> I'm trying to combine multiple parquet files.  They were produced at
> different points in time and have different columns.  For example, one has
> columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
> want to concatenate all three into one table with columns A, B, C, D, E.
>
> To do this I am adding the missing columns to each table.  For example, I
> am adding column D to table one and setting all values to null.  In order
> to do this I need to create a vector with length equal to one.num_rows and
> set all values to null.  The vector type is controlled by the type of D in
> the other tables.
>
> I am currently doing this by creating one large python list ahead of time
> and using:
>
> pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
> desired_size)
>
> However, this ends up being very slow.  The calls to pa.array take longer
> than reading the data in the first place.
>
> I can build a large empty vector for every possible data type at the start
> of my application but that seems inefficient.
>
> Is there a good way to initialize a vector with all null values that I am
> missing?
>


[jira] [Created] (ARROW-7080) [Python][Parquet] Expose parquet field_id in Schema objects

2019-11-06 Thread Ted Gooch (Jira)
Ted Gooch created ARROW-7080:


 Summary: [Python][Parquet] Expose parquet field_id in Schema 
objects
 Key: ARROW-7080
 URL: https://issues.apache.org/jira/browse/ARROW-7080
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Ted Gooch


I'm in the process of adding parquet read support to 
Iceberg([https://iceberg.apache.org/]), and we use the parquet field_ids as a 
consistent id when reading a parquet file to create a map between the current 
schema and the schema of the file being read.  Unless I've missed something, it 
appears that field_id is not exposed in the python APIs in 
pyarrow._parquet.ParquetSchema nor is it available in pyarrow.lib.Schema.

Would it be possible to add this to either of those two objects?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: questions about Gandiva

2019-10-31 Thread Ted Gooch
You can also see some of the Gandiva python bindings in the tests in
pyarrow:
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py


On Thu, Oct 31, 2019 at 10:26 AM Wes McKinney  wrote:

> hi
>
> On Thu, Oct 31, 2019 at 12:11 AM Yibo Cai  wrote:
> >
> > Hi,
> >
> > Arrow cpp integrates Gandiva to provide low level operations on arrow
> buffers. [1][2]
> > I have some questions, any help is appreciated:
> > - Arrow cpp already has a compute kernel[3], does it duplicate what
> Gandiva provides? I see a Jira talk about it.[4]
>
> No. There are some cases of functional overlap but we are servicing a
> spectrum of use cases beyond the scope of Gandiva. Additionally, it is
> unclear to me that an LLVM JIT compilation step should be required to
> evaluate simple expressions such as "a > 5" -- in addition to
> introducing latency (due to the compilation step) it is also a heavy
> dependency to require the LLVM runtime in all applications.
>
> Personally I'm interested in supporting a wide gamut of analytics
> workloads, from data frame / data science type libraries to SQL-like
> systems. Gandiva is designed for the needs of a SQL-based execution
> engine where chunks of data are fed into Projection or Filter nodes in
> a computation graph -- Gandiva generates a specialized kernel to
> perform a unit of work inside those nodes. Realistically, I expect
> many real world applications will contain a mixture of pre-compiled
> analytic kernels and JIT-compiled kernels.
>
> Rome wasn't built in a day, so I'm expecting several years of work
> ahead of us at the present rate. We need more help in this domain.
>
> > - Is Gandiva only for arrow cpp? What about other languages(go, rust,
> ...)?
>
> It's being used in Java via JNI. The same approach could be applied
> for the other languages as they have their own C FFI mechanisms.
>
> > - Gandiva leverages SIMD for vectorized operations[1], but I didn't see
> any related code. Am I missing something?
>
> My understanding is that LLVM inserts many SIMD instructions
> automatically based on the host CPU architecture version. Gandiva
> developers may have some comments / pointers about this
>
> >
> > [1]
> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/
> > [2] https://github.com/apache/arrow/tree/master/cpp/src/gandiva
> > [3] https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> > [4] https://issues.apache.org/jira/browse/ARROW-7017
> >
> > Thanks,
> > Yibo
>


Re: A couple of questions about pyarrow.parquet

2019-05-17 Thread Ted Gooch
Thanks Micah and Wes.

Definitely interested in the *Predicate Pushdown* and *Schema inference,
schema-on-read, and schema normalization *sections.

On Fri, May 17, 2019 at 12:47 PM Wes McKinney  wrote:

> Please see also
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
>
> And prior mailing list discussion. I will comment in more detail on the
> other items later
>
> On Fri, May 17, 2019, 2:44 PM Micah Kornfield 
> wrote:
>
> > I can't help on the first question.
> >
> > Regarding push-down predicates, there is an open JIRA [1] to do just that
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > <
> >
> https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > >
> >
> > On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
> >
> > > Hi,
> > >
> > > I've been doing some work trying to get the parquet read path going for
> > the
> > > python iceberg <https://github.com/apache/incubator-iceberg>
> library.  I
> > > have two questions that I couldn't get figured out, and was hoping I
> > could
> > > get some guidance from the list here.
> > >
> > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > it
> > > appears that only limited information is available in the ColumnSchema
> > > passed back to the python client[2]:
> > >
> > > 
> > >   name: key
> > >   path: m.map.key
> > >   max_definition_level: 2
> > >   max_repetition_level: 1
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > 
> > >   name: key
> > >   path: m.map.value.map.key
> > >   max_definition_level: 4
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > 
> > >   name: value
> > >   path: m.map.value.map.value
> > >   max_definition_level: 5
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > >
> > >
> > > where physical_type and logical_type are both strings[1].  The arrow
> > schema
> > > I can get from *to_arrow_schema *looks to be more expressive(although
> may
> > > be I just don't understand the parquet format well enough):
> > >
> > > m: struct list > > struct not null>>> not null>>
> > >   child 0, map: list > list > > struct not null>>> not null>
> > >   child 0, map: struct > > struct not null>>>
> > >   child 0, key: string
> > >   child 1, value: struct > value:
> > > string> not null>>
> > >   child 0, map: list string>
> > > not null>
> > >   child 0, map: struct
> > >   child 0, key: string
> > >   child 1, value: string
> > >
> > >
> > > It seems like I can infer the info from the name/path, but is there a
> > more
> > > direct way of getting the detailed parquet schema information?
> > >
> > > Second question, is there a way to push record level filtering into the
> > > parquet reader, so that the parquet reader only reads in values that
> > match
> > > a given predicate expression? Predicate expressions would be simple
> > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > connected with logical operators(AND, OR, NOT).
> > >
> > > I've seen that after reading-in I can use the filtering language in
> > > gandiva[3] to get filtered record-batches, but was looking for
> somewhere
> > > lower in the stack if possible.
> > >
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > CREATE TABLE `iceberg`.`nested_map` (
> > > m map>)
> > > [3]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > >
> >
>


A couple of questions about pyarrow.parquet

2019-05-17 Thread Ted Gooch
Hi,

I've been doing some work trying to get the parquet read path going for the
python iceberg  library.  I
have two questions that I couldn't get figured out, and was hoping I could
get some guidance from the list here.

First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
appears that only limited information is available in the ColumnSchema
passed back to the python client[2]:


  name: key
  path: m.map.key
  max_definition_level: 2
  max_repetition_level: 1
  physical_type: BYTE_ARRAY
  logical_type: UTF8

  name: key
  path: m.map.value.map.key
  max_definition_level: 4
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8

  name: value
  path: m.map.value.map.value
  max_definition_level: 5
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8


where physical_type and logical_type are both strings[1].  The arrow schema
I can get from *to_arrow_schema *looks to be more expressive(although may
be I just don't understand the parquet format well enough):

m: struct not null>>> not null>>
  child 0, map: list not null>>> not null>
  child 0, map: struct not null>>>
  child 0, key: string
  child 1, value: struct not null>>
  child 0, map: list
not null>
  child 0, map: struct
  child 0, key: string
  child 1, value: string


It seems like I can infer the info from the name/path, but is there a more
direct way of getting the detailed parquet schema information?

Second question, is there a way to push record level filtering into the
parquet reader, so that the parquet reader only reads in values that match
a given predicate expression? Predicate expressions would be simple
field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
connected with logical operators(AND, OR, NOT).

I've seen that after reading-in I can use the filtering language in
gandiva[3] to get filtered record-batches, but was looking for somewhere
lower in the stack if possible.



[1]
https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
[2] Spark/Hive Table DDL for this parquet file looks like:
CREATE TABLE `iceberg`.`nested_map` (
m map>)
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100