Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Micah Kornfield
>
> What is the parallel-list means?

Something like:

table RecordBatch {
nodes: [FieldNode];
// Statistics related to the data represented by each FieldNode
// This field is either length=0 or has the same length as nodes.
statistics: [Statistic];
}

On Wed, Feb 17, 2021 at 8:34 PM Kohei KaiGai  wrote:

> Thanks for the clarification.
>
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> For example, JSON array of min/max values as a key-value metadata
> in the Footer->Schema->Fields[]->custom_metadata?
> Even though the metadata field must be less than INT_MAX, I think it
> is enough portable and not invasive way.
>
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not
> use
> > KeyValue but should have something where the values can be bytes.
> >
> What is the parallel-list means?
> If we would have a standardized binary structure, like DictionaryBatch,
> to store the statistics including min/max values, it exactly makes sense
> more than text-encoded key-value metadata, of course.
>
> Best regards,
>
> 2021年2月18日(木) 12:37 Micah Kornfield :
> >
> > There is key-value metadata available on Message which might be able to
> > work in the short term (some sort of encoded message).  I think
> > standardizing how we store statistics per batch does make sense.
> >
> > We unfortunately can't add anything to field-node without breaking
> > compatibility.  But  another option would be to add a new structure as a
> > parallel list on RecordBatch itself.
> >
> > If we do add a new structure or arbitrary key-value pair we should not
> use
> > KeyValue but should have something where the values can be bytes.
> >
> > On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai 
> wrote:
> >
> > > Hello,
> > >
> > > Does Apache Arrow have any standard way to embed min/max values of the
> > > fields
> > > per record-batch basis?
> > > It looks FieldNode supports neither dedicated min/max attribute nor
> > > custom-metadata.
> > > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> > >
> > > If we embed an array of min/max values into the custom-metadata of the
> > > Field-node,
> > > we may be able to implement.
> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> > >
> > > What I like to implement is something like BRIN index at PostgreSQL.
> > > http://heterodb.github.io/pg-strom/brin/
> > >
> > > This index contains only min/max values for a particular block ranges,
> and
> > > query
> > > executor can skip blocks that obviously don't contain the target data.
> > > If we can skip 9990 of 1 record batch by checking metadata on a
> query
> > > that
> > > tries to fetch items in very narrow timestamps, it is a great
> > > acceleration more than
> > > full file scans.
> > >
> > > Best regards,
> > > --
> > > HeteroDB, Inc / The PG-Strom Project
> > > KaiGai Kohei 
> > >
>
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei 
>


Re: [Python] A user friendly way to filter parquet partitions

2021-02-17 Thread Micah Kornfield
Hi Weiyang,
The library looks interesting, and for python certainly seems like it might
add a better user experience.

I'm not super active in python maintenance (others who are can hopefully
chime in).  But my impression is we try to keep dependencies minimal in
general.

Furthermore, the goal of the C++ library and associated bindings is to push
as much work down into C++ (ultimately filtering capabilities equivalent to
Pandas will be built)  so that all languages  can take advantage of the
same core code.

-Micah


On Sun, Feb 14, 2021 at 10:09 PM Bill Zhao  wrote:

> Hi Dev team,
>
> I created a pypi package to allow user friendly expression of conditions.
> For example, a condition can be written as:
>
> (f.A <= 3 or f.B != 'b1') and f.C == ['c1', 'c2']
>
> where A, B, C are partition keys, and f.C == ['c1', 'c2']  means f.C in
> ['c1',
> 'c2'].
>
> Arbitrary condition objects can be converted to pyarrow's filters by
> calling its
>
> to_pyarrow_filter() method, which will normalize the condition to conform
> to pyarrow filter specification. The filter can also be converted back to a
> condition object.
>
> We can therefore take a condition object as the filter parameter directly
> in read_table() and ParquetDatasetap() api as a user friendly way to create
> the conditions.
>
> Furthermore,  the condition object be directly used to filter partition
> paths. This can replace the current complex filtering codes. (both native
> and python)
>
> For max efficiency, filtering with the condition object can be done in the
> below ways:
>
>1. read the paths in chunks to keep the memory footprint small;
>2. parse the paths to be a pandas dataframe;
>3. use condition.query(dataframe) to get the filtered dataframe of path.
>4. use numexpr backend for dataframe query for efficiency.
>5. concat the filtered dataframe of each chunk
>
> For usage details of the package, please see its document at:
>
> https://condition.readthedocs.io/en/latest/usage.html
> 
>
>
> https://condition.readthedocs.io/en/latest/usage.html#pyarrow-partition-filtering
>
> What do you think? Your discussion and suggestion is appreciated.
>
>  A JIRA ticket is already created:
>
> https://issues.apache.org/jira/projects/ARROW/issues/ARROW-11566
>
> Thank you,
>
> Weiyang (Bill)
>


Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Micah Kornfield
>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.


I seem to recall the debate was how to model some class interactions to
determine what should be considered shared structures and what should not.

On Wed, Feb 17, 2021 at 9:52 AM Gidon Gershinsky  wrote:

> This certainly sounds good to me.
>
> Cheers, Gidon
>
>
> On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou  wrote:
>
> >
> > I don't think any notion of threading should be present in the
> > implementation, except for the required locks around shared structures.
> >  I don't know where the idea of a "main thread" comes from, but it
> > probably shouldn't exist in a C++ library.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> > > Just to clarify. There are two options, which one do you refer to? A
> > design
> > > with a main thread that handles projections and the keys (relevant for
> > the
> > > projected columns); or the current code with any thread allowed to
> handle
> > > full file reading, inc the footer, column projections and their keys?
> Can
> > > you finalize this with Micah?
> > > The good news is, Tham is still interested to resume this work, and is
> ok
> > > with either option. Please let her know whether the current threading
> > model
> > > stays, or should be modified with the changes proposed in the doc (for
> > the
> > > latter, some guidance with the details would be needed).
> > >
> > > Cheers, Gidon
> > >
> > >
> > > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou 
> > wrote:
> > >
> > >>
> > >>
> > >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> > >>> From the doc,
> > >>> "To maintain consistency with the style of parquet-cpp, the above
> > >>> structures should not be explicitly synchronized with individual
> > mutexes.
> > >>> In the case of a parquet::arrow::FileReader, the request to read a
> > given
> > >>> selection of row groups and columns is issued from a single main
> > thread.
> > >>> Note that this does require that all keys required for a read are
> > >> assembled
> > >>> on the main thread so that DecryptionKeyRetriever objects are not
> > >> directly
> > >>> accessing any caches"
> > >>>
> > >>> The current PR code doesn't require a single main thread. Any thread
> > can
> > >>> read any file, both footer and pages. So the key cache is shared, to
> > save
> > >>> N-1 interactions with the KMS server.
> > >>
> > >> I don't think there's any contention on this.  IMHO the only concerns
> > >> are about the implementation, not the semantics.
> > >>
> > >> Best regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> Cheers, Gidon
> > >>>
> > >>>
> > >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou 
> > >> wrote:
> > >>>
> > 
> >  I'm not sure a threading model is expected for an encryption layer.
> > Am
> >  I missing something?
> > 
> >  Regards
> > 
> >  Antoine.
> > 
> > 
> >  Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > > Precisely, the main change is in the threading model. Afaik, the
> > >> document
> > > proposes a model that fits pandas, but might be problematic for
> other
> >  users
> > > of this library.
> > > Technically, this is not showstopper though; if the community
> decides
> > >> on
> > > this model, it will be compatible with the high-level encryption
> > >> design;
> > > but the change implementation would need to be done by pandas
> experts
> >  (not
> > > us; but we'll help where we can).
> > > Micah, you know this subject (and the community) better than we do
> -
> > >> we'd
> > > much appreciate it if you'd take a lead on removing this roadblock.
> > >
> > > Cheers, Gidon
> > >
> > >
> > > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> > emkornfi...@gmail.com
> > >>>
> > > wrote:
> > >
> > >> I think some of the comments might be conflicting.  One of the
> > >> concerns
> > >> (that I would need to refresh myself on to offer an opinion which
> > was
> > >> covered in Ben's doc) was the threading model we expect in the
> > >> library.
> > >>
> > >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou <
> anto...@python.org>
> >  wrote:
> > >>
> > >>>
> > >>> Hi Gidon,
> > >>>
> > >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> >  Regarding the high-level layer, I think it waits for a progress
> at
> > 
> > >>>
> > >>
> > 
> > >>
> >
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> >  No activity there since last November. This is unfortunate,
> > because
> > >> Tham
> >  has put a lot of work in coding the high-level layer (and
> > addressing
> > >> 200+
> >  review comments) in the PR
> > >> https://github.com/apache/arrow/pull/8023.
> > >>> The
> > 

Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Thanks for the clarification.

> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
For example, JSON array of min/max values as a key-value metadata
in the Footer->Schema->Fields[]->custom_metadata?
Even though the metadata field must be less than INT_MAX, I think it
is enough portable and not invasive way.

> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
What is the parallel-list means?
If we would have a standardized binary structure, like DictionaryBatch,
to store the statistics including min/max values, it exactly makes sense
more than text-encoded key-value metadata, of course.

Best regards,

2021年2月18日(木) 12:37 Micah Kornfield :
>
> There is key-value metadata available on Message which might be able to
> work in the short term (some sort of encoded message).  I think
> standardizing how we store statistics per batch does make sense.
>
> We unfortunately can't add anything to field-node without breaking
> compatibility.  But  another option would be to add a new structure as a
> parallel list on RecordBatch itself.
>
> If we do add a new structure or arbitrary key-value pair we should not use
> KeyValue but should have something where the values can be bytes.
>
> On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai  wrote:
>
> > Hello,
> >
> > Does Apache Arrow have any standard way to embed min/max values of the
> > fields
> > per record-batch basis?
> > It looks FieldNode supports neither dedicated min/max attribute nor
> > custom-metadata.
> > https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
> >
> > If we embed an array of min/max values into the custom-metadata of the
> > Field-node,
> > we may be able to implement.
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
> >
> > What I like to implement is something like BRIN index at PostgreSQL.
> > http://heterodb.github.io/pg-strom/brin/
> >
> > This index contains only min/max values for a particular block ranges, and
> > query
> > executor can skip blocks that obviously don't contain the target data.
> > If we can skip 9990 of 1 record batch by checking metadata on a query
> > that
> > tries to fetch items in very narrow timestamps, it is a great
> > acceleration more than
> > full file scans.
> >
> > Best regards,
> > --
> > HeteroDB, Inc / The PG-Strom Project
> > KaiGai Kohei 
> >



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Micah Kornfield
>
> I didn’t find any page/documentation on how to do RFC in Arrow protocol,
> so can anyone point me to it or PR with email will be enough?

That is enough to start discussion.  Before formal acceptance and merging
of the PR there needs to be a Java and C++ implementations for the type
that pass integration tests.  At the time this guideline was instituted
Java and C++ were considered the "reference" implementations (I think they
still have the most complete integration test coverage).

My understanding is that the current modelling of intervals mimics SQL
standards (e.g. SQL Server [1]).  So it would also be good to step back and
understand what problem DF is trying to solve and how it differs from other
SQL implementations.  I'd be hesitant to accept COMPLEX as a new type
without a much deeper analysis into calendar representations within Arrow
and how they relate to other existing systems (e.g. Hive and some
assortment of existing SQL databases).  For instance the current modelling
of timestamps does not lend itself to constructing a COMPLEX interval type
particularly well. (Duration was introduced for this reason).

I think both Wes's suggestion of FixedSizeBinary and Andrew's of composing
the with a struct are good stop-gaps.  These obviously have different
trade-offs.  Ultimately, it would be good to define common extension types
that can represent this use-case if there really is demand for it (if it
doesn't become a top level type).

[1]
https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/interval-data-types?view=sql-server-ver15

-Micah

On Wed, Feb 17, 2021 at 2:05 PM Andrew Lamb  wrote:

> That is a great suggestion Wes, thank you.
>
> I wonder if we could get away with a 128 bit representation that is the
> concatenation of the two existing interval types (YearMonth)(DayTime). Or
> maybe even define a `struct` type with those fields that is used by
> DataFusion.
>
> Basically, given our reading of the Arrow spec[1], it is currently not
> possible to precisely represent an interval that has both monthly and
> sub-montly granularity.
>
> As Dmtry says, if you have an interval seemingly simple like  1 month, 1
> day
>
> Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
> Using IntervalUnit(DAY_TIME) can't represent the month as different months
> have different numbers of days
>
> [1]
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260
>
>
> On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney  wrote:
>
> > On Wed, Feb 17, 2021 at 3:46 PM  wrote:
> > >
> > > > It's unclear to me that this needs to be introduced into the
> top-level
> > >
> > > Similar thing to columnar format, How to store interval like 1 month 1
> > day 1 hour? It’s not possible to do it without converting 1 month to 30
> > days, which is a bad way.
> > >
> >
> > Presumably you can represent a complex interval in a fixed number of
> > bytes, and then embed the data in a FixedSizeBinary type. You can
> > adorn this type with extension type metadata so that DataFusion can
> > then apply Interval semantics to it. This could also serve as an
> > interim strategy for you to proceed with implementation while
> > proposing a top-level type to the Arrow format (which may or may not
> > be accepting) so you aren't blocked on acceptance of changes into
> > Schema.fbs.
> >
> > > > On 17 Feb 2021, at 21:02, Wes McKinney  wrote:
> > > >
> > > > It's unclear to me that this needs to be introduced into the
> top-level
> > > > columnar format without more analysis — have you considered
> > > > implementing this for DataFusion as an extension type for the time
> > > > being?
> > > >
> > > > On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me  >
> > mailto:t...@dmtry.me>> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> For now, There are only two types of IntervalUnit inside Arrow:
> > > >>
> > > >> - YearMonth - month stored as int32
> > > >> - DayTime - days as int32 and time in milliseconds  as in32. Total
> > (64 bites)
> > > >>
> > > >> Since DF is using Arrow, It’s not possible to store “Complex”
> > intervals such 1 MONTH 1 DAY 1 HOUR.
> > > >> I think, the best way to understand the problem will be to read a
> > comment from DF codebase:
> >
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> > > >>
> > > >>// Interval is tricky thing
> > > >>// 1 day is not 24 hours because timezones, 1 year !=
> 365/364!
> > 30 days != 1 month
> > > >>// The true way to store and calculate intervals is to store
> > it as it defined
> > > >>// Due the fact that Arrow supports only two types YearMonth
> > (month) and DayTime (day, time)
> > > >>// It's not possible to store complex intervals
> > > >>// It's possible to do select (NOW() + INTERVAL '1 year') +
> > INTERVAL '1 day'; as workaround
> > > >>if result_month != 0 && (result_days != 0 || result_millis !=
> > 0) {
> > > >>   

Re: Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Micah Kornfield
Congrats!

On Wed, Feb 17, 2021 at 4:12 PM Wes McKinney  wrote:

> This is great news! Congrats to everyone who worked on this to make it
> possible. I know that the cross-endianness question was something that
> came up periodically (even though BE systems are increasingly exotic
> nowadays) so it's great that we now have a robust answer
>
> On Wed, Feb 17, 2021 at 8:48 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > I would like to announce that we have just merged
> > https://github.com/apache/arrow/pull/7507, which implements - on the C++
> > side - endianness conversion when reading IPC data with non-native
> > endianness.
> >
> > This means that IPC and Flight communication using Arrow C++ should be
> > possible between systems with different native endiannesses.  This
> > feature is experimental for now: there are basic integration tests, but
> > it should be exercised a bit more before we declare it stable.
> >
> > We've also added an entry in the feature compatibility matrix:
> >
> https://github.com/apache/arrow/blob/master/docs/source/status.rst#ipc-format
> >
> > Regards
> >
> > Antoine.
>


Re: Any standard way for min/max values per record-batch?

2021-02-17 Thread Micah Kornfield
There is key-value metadata available on Message which might be able to
work in the short term (some sort of encoded message).  I think
standardizing how we store statistics per batch does make sense.

We unfortunately can't add anything to field-node without breaking
compatibility.  But  another option would be to add a new structure as a
parallel list on RecordBatch itself.

If we do add a new structure or arbitrary key-value pair we should not use
KeyValue but should have something where the values can be bytes.

On Wed, Feb 17, 2021 at 7:17 PM Kohei KaiGai  wrote:

> Hello,
>
> Does Apache Arrow have any standard way to embed min/max values of the
> fields
> per record-batch basis?
> It looks FieldNode supports neither dedicated min/max attribute nor
> custom-metadata.
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L28
>
> If we embed an array of min/max values into the custom-metadata of the
> Field-node,
> we may be able to implement.
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344
>
> What I like to implement is something like BRIN index at PostgreSQL.
> http://heterodb.github.io/pg-strom/brin/
>
> This index contains only min/max values for a particular block ranges, and
> query
> executor can skip blocks that obviously don't contain the target data.
> If we can skip 9990 of 1 record batch by checking metadata on a query
> that
> tries to fetch items in very narrow timestamps, it is a great
> acceleration more than
> full file scans.
>
> Best regards,
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei 
>


Any standard way for min/max values per record-batch?

2021-02-17 Thread Kohei KaiGai
Hello,

Does Apache Arrow have any standard way to embed min/max values of the fields
per record-batch basis?
It looks FieldNode supports neither dedicated min/max attribute nor
custom-metadata.
https://github.com/apache/arrow/blob/master/format/Message.fbs#L28

If we embed an array of min/max values into the custom-metadata of the
Field-node,
we may be able to implement.
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L344

What I like to implement is something like BRIN index at PostgreSQL.
http://heterodb.github.io/pg-strom/brin/

This index contains only min/max values for a particular block ranges, and query
executor can skip blocks that obviously don't contain the target data.
If we can skip 9990 of 1 record batch by checking metadata on a query that
tries to fetch items in very narrow timestamps, it is a great
acceleration more than
full file scans.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Wes McKinney
This is great news! Congrats to everyone who worked on this to make it
possible. I know that the cross-endianness question was something that
came up periodically (even though BE systems are increasingly exotic
nowadays) so it's great that we now have a robust answer

On Wed, Feb 17, 2021 at 8:48 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> I would like to announce that we have just merged
> https://github.com/apache/arrow/pull/7507, which implements - on the C++
> side - endianness conversion when reading IPC data with non-native
> endianness.
>
> This means that IPC and Flight communication using Arrow C++ should be
> possible between systems with different native endiannesses.  This
> feature is experimental for now: there are basic integration tests, but
> it should be exercised a bit more before we declare it stable.
>
> We've also added an entry in the feature compatibility matrix:
> https://github.com/apache/arrow/blob/master/docs/source/status.rst#ipc-format
>
> Regards
>
> Antoine.


Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Andrew Lamb
That is a great suggestion Wes, thank you.

I wonder if we could get away with a 128 bit representation that is the
concatenation of the two existing interval types (YearMonth)(DayTime). Or
maybe even define a `struct` type with those fields that is used by
DataFusion.

Basically, given our reading of the Arrow spec[1], it is currently not
possible to precisely represent an interval that has both monthly and
sub-montly granularity.

As Dmtry says, if you have an interval seemingly simple like  1 month, 1 day

Using IntervalUnit(YEAR_MONTH) can't represent the 1 day
Using IntervalUnit(DAY_TIME) can't represent the month as different months
have different numbers of days

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L249-L260


On Wed, Feb 17, 2021 at 5:01 PM Wes McKinney  wrote:

> On Wed, Feb 17, 2021 at 3:46 PM  wrote:
> >
> > > It's unclear to me that this needs to be introduced into the top-level
> >
> > Similar thing to columnar format, How to store interval like 1 month 1
> day 1 hour? It’s not possible to do it without converting 1 month to 30
> days, which is a bad way.
> >
>
> Presumably you can represent a complex interval in a fixed number of
> bytes, and then embed the data in a FixedSizeBinary type. You can
> adorn this type with extension type metadata so that DataFusion can
> then apply Interval semantics to it. This could also serve as an
> interim strategy for you to proceed with implementation while
> proposing a top-level type to the Arrow format (which may or may not
> be accepting) so you aren't blocked on acceptance of changes into
> Schema.fbs.
>
> > > On 17 Feb 2021, at 21:02, Wes McKinney  wrote:
> > >
> > > It's unclear to me that this needs to be introduced into the top-level
> > > columnar format without more analysis — have you considered
> > > implementing this for DataFusion as an extension type for the time
> > > being?
> > >
> > > On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me 
> mailto:t...@dmtry.me>> wrote:
> > >>
> > >> Hi,
> > >>
> > >> For now, There are only two types of IntervalUnit inside Arrow:
> > >>
> > >> - YearMonth - month stored as int32
> > >> - DayTime - days as int32 and time in milliseconds  as in32. Total
> (64 bites)
> > >>
> > >> Since DF is using Arrow, It’s not possible to store “Complex”
> intervals such 1 MONTH 1 DAY 1 HOUR.
> > >> I think, the best way to understand the problem will be to read a
> comment from DF codebase:
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> > >>
> > >>// Interval is tricky thing
> > >>// 1 day is not 24 hours because timezones, 1 year != 365/364!
> 30 days != 1 month
> > >>// The true way to store and calculate intervals is to store
> it as it defined
> > >>// Due the fact that Arrow supports only two types YearMonth
> (month) and DayTime (day, time)
> > >>// It's not possible to store complex intervals
> > >>// It's possible to do select (NOW() + INTERVAL '1 year') +
> INTERVAL '1 day'; as workaround
> > >>if result_month != 0 && (result_days != 0 || result_millis !=
> 0) {
> > >>return Err(DataFusionError::NotImplemented(format!(
> > >>"DF does not support intervals that have both a
> Year/Month part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try
> breaking the interval into two parts, one with Year/Month and the other
> with Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + INTERVAL
> '1 day'",
> > >>value
> > >>)));
> > >>}
> > >>
> > >>
> > >>
> > >> I prepared a PR https://github.com/apache/arrow/pull/9516/files <
> https://github.com/apache/arrow/pull/9516/files> <
> https://github.com/apache/arrow/pull/9516/files <
> https://github.com/apache/arrow/pull/9516/files>> that introduce a new
> type for IntervalUnit called Complex, that store both YearMonth and DayTime
> to support complex interval.
> > >> I didn’t find any page/documentation on how to do RFC in Arrow
> protocol, so can anyone point me to it or PR with email will be enough?
> > >>
> > >> Thanks.
> >
>


Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Wes McKinney
On Wed, Feb 17, 2021 at 3:46 PM  wrote:
>
> > It's unclear to me that this needs to be introduced into the top-level
>
> Similar thing to columnar format, How to store interval like 1 month 1 day 1 
> hour? It’s not possible to do it without converting 1 month to 30 days, which 
> is a bad way.
>

Presumably you can represent a complex interval in a fixed number of
bytes, and then embed the data in a FixedSizeBinary type. You can
adorn this type with extension type metadata so that DataFusion can
then apply Interval semantics to it. This could also serve as an
interim strategy for you to proceed with implementation while
proposing a top-level type to the Arrow format (which may or may not
be accepting) so you aren't blocked on acceptance of changes into
Schema.fbs.

> > On 17 Feb 2021, at 21:02, Wes McKinney  wrote:
> >
> > It's unclear to me that this needs to be introduced into the top-level
> > columnar format without more analysis — have you considered
> > implementing this for DataFusion as an extension type for the time
> > being?
> >
> > On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me  
> > mailto:t...@dmtry.me>> wrote:
> >>
> >> Hi,
> >>
> >> For now, There are only two types of IntervalUnit inside Arrow:
> >>
> >> - YearMonth - month stored as int32
> >> - DayTime - days as int32 and time in milliseconds  as in32. Total (64 
> >> bites)
> >>
> >> Since DF is using Arrow, It’s not possible to store “Complex” intervals 
> >> such 1 MONTH 1 DAY 1 HOUR.
> >> I think, the best way to understand the problem will be to read a comment 
> >> from DF codebase: 
> >> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
> >>
> >>// Interval is tricky thing
> >>// 1 day is not 24 hours because timezones, 1 year != 365/364! 30 
> >> days != 1 month
> >>// The true way to store and calculate intervals is to store it as 
> >> it defined
> >>// Due the fact that Arrow supports only two types YearMonth 
> >> (month) and DayTime (day, time)
> >>// It's not possible to store complex intervals
> >>// It's possible to do select (NOW() + INTERVAL '1 year') + 
> >> INTERVAL '1 day'; as workaround
> >>if result_month != 0 && (result_days != 0 || result_millis != 0) {
> >>return Err(DataFusionError::NotImplemented(format!(
> >>"DF does not support intervals that have both a Year/Month 
> >> part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try breaking the 
> >> interval into two parts, one with Year/Month and the other with 
> >> Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + INTERVAL '1 
> >> day'",
> >>value
> >>)));
> >>}
> >>
> >>
> >>
> >> I prepared a PR https://github.com/apache/arrow/pull/9516/files 
> >>  
> >>  >> > that introduce a new 
> >> type for IntervalUnit called Complex, that store both YearMonth and 
> >> DayTime to support complex interval.
> >> I didn’t find any page/documentation on how to do RFC in Arrow protocol, 
> >> so can anyone point me to it or PR with email will be enough?
> >>
> >> Thanks.
>


Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
Read more (this is one ASF member's interpretation of the Openness
tenet of the Apache Way) about this:

http://theapacheway.com/open/

On Wed, Feb 17, 2021 at 3:46 PM Wes McKinney  wrote:
>
> For trivial PRs that do not merit mention in the changelog you could
> preface the issue title with something like "ARROW-XXX" and we can
> modify the merge tool to bypass the consistency check for these. I
> think some other Apache projects do this. I can understand how it
> might seem like a nuisance to get a Jira when fixing a typo in a
> README, so this is easy to fix.
>
> For contributors doing non-trivial work, I think we want to try to get
> people in the habit of putting out there what they are working on.
> That's the thing that's most consistent with "The Apache Way" — write
> things down, make plans in the open, allow others to see what is going
> on and not have the roadmap existing exclusively in people's minds.
>
> On Wed, Feb 17, 2021 at 3:41 PM Andrew Lamb  wrote:
> >
> > Thanks for the background Wes. This is exactly what I was looking for.
> >
> > I think using JIRA for the single source of truth / project management has
> > lots of value and I don't want to propose changing that. I am trying to
> > lower the barrier to contributing to Arrow even more.
> >
> > While I agree creating JIRA tickets is not hard, it is simply a few more
> > steps for every PR and every contributor. The overhead is that much more if
> > you don't already have a JIRA account -- if I can avoid just a few more
> > steps and get a few more contributors I will consider it a win.
> >
> > Given this info, I will do some research into the technical options, and
> > make a more concrete proposal / prototype for automation in a while.
> >
> > Thanks again,
> > Andrew
> >
> > On Wed, Feb 17, 2021 at 1:28 PM Wes McKinney  wrote:
> >
> > > hi Andrew,
> > >
> > > There isn't a hard requirement. It's a culture thing where the purpose
> > > of Jira issues is to create a changelog and for developers to
> > > communicate publicly what work they are proposing to perform in the
> > > project. We decided by consensus (essentially) that having a single
> > > point of truth for developer activity in the project was a good idea.
> > >
> > > On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb  wrote:
> > > >
> > > > Can someone tell me / point me at what the actual "requirements" for
> > > using
> > > > JIRA in Apache Arrow are?
> > > >
> > > > Specifically, I would like to know:
> > > >
> > > > 1. Where does the requirement for each commit to have a JIRA ticket come
> > > > from? (Is that Apache Arrow specific, or is it a more general Apache
> > > > governance requirement? Something else?)
> > > >
> > > > 2. Does each commit need to be associated with a specific JIRA user
> > > > account, or is a github username sufficient?
> > >
> > > We would prefer that issues be assigned to a Jira user. If you want to
> > > create an issue on behalf of an uncooperative person and assign it to
> > > yourself, you can do that, too.
> > >
> > > > Background:  I am following up on an item raised at the Arrow Sync call
> > > > today and trying to determine how much of the current required Arrow 
> > > > JIRA
> > > > process could be automated. Micah mentioned that the JIRA specifics 
> > > > might
> > > > be related to ASF governance process or requirements, and I am trying to
> > > > research what those are.
> > >
> > > We could easily automate the creation of a Jira issue using a bot of
> > > some kind. I don't think that creating issue is a hardship, though
> > > (having created thousands of them myself over the last 5 years). My
> > > position is that the hardship exists in the mind of the user and isn't
> > > actually real. It would be better if contributors would indicate the
> > > work they are proposing to contribute to the project before opening a
> > > pull request (so that others know that someone is working on
> > > something), but I understand that not everyone is going to do that.
> > >
> > > > I googled around but could not find anything at the Arrow or ASF level
> > > > about *WHY* Arrow has the current JIRA process requirements (though the
> > > > required process itself is well documented):
> > > >
> > > > Places I looked
> > > > * https://infra.apache.org/policies.html
> > > > *
> > > >
> > > https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
> > > > * http://www.apache.org/licenses/contributor-agreements.html
> > > > * http://www.apache.org/licenses/cla-faq.html
> > > > * various google searches
> > > >
> > > > I apologize if I missed something obvious.
> > > >
> > > > Any help would be most appreciated,
> > > > Andrew
> > >


Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
I like the idea of encouraging regular contributors to use JIRA more
consistently

On Wed, Feb 17, 2021 at 4:47 PM Wes McKinney  wrote:

> For trivial PRs that do not merit mention in the changelog you could
> preface the issue title with something like "ARROW-XXX" and we can
> modify the merge tool to bypass the consistency check for these. I
> think some other Apache projects do this. I can understand how it
> might seem like a nuisance to get a Jira when fixing a typo in a
> README, so this is easy to fix.
>
> For contributors doing non-trivial work, I think we want to try to get
> people in the habit of putting out there what they are working on.
> That's the thing that's most consistent with "The Apache Way" — write
> things down, make plans in the open, allow others to see what is going
> on and not have the roadmap existing exclusively in people's minds.
>
> On Wed, Feb 17, 2021 at 3:41 PM Andrew Lamb  wrote:
> >
> > Thanks for the background Wes. This is exactly what I was looking for.
> >
> > I think using JIRA for the single source of truth / project management
> has
> > lots of value and I don't want to propose changing that. I am trying to
> > lower the barrier to contributing to Arrow even more.
> >
> > While I agree creating JIRA tickets is not hard, it is simply a few more
> > steps for every PR and every contributor. The overhead is that much more
> if
> > you don't already have a JIRA account -- if I can avoid just a few more
> > steps and get a few more contributors I will consider it a win.
> >
> > Given this info, I will do some research into the technical options, and
> > make a more concrete proposal / prototype for automation in a while.
> >
> > Thanks again,
> > Andrew
> >
> > On Wed, Feb 17, 2021 at 1:28 PM Wes McKinney 
> wrote:
> >
> > > hi Andrew,
> > >
> > > There isn't a hard requirement. It's a culture thing where the purpose
> > > of Jira issues is to create a changelog and for developers to
> > > communicate publicly what work they are proposing to perform in the
> > > project. We decided by consensus (essentially) that having a single
> > > point of truth for developer activity in the project was a good idea.
> > >
> > > On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb 
> wrote:
> > > >
> > > > Can someone tell me / point me at what the actual "requirements" for
> > > using
> > > > JIRA in Apache Arrow are?
> > > >
> > > > Specifically, I would like to know:
> > > >
> > > > 1. Where does the requirement for each commit to have a JIRA ticket
> come
> > > > from? (Is that Apache Arrow specific, or is it a more general Apache
> > > > governance requirement? Something else?)
> > > >
> > > > 2. Does each commit need to be associated with a specific JIRA user
> > > > account, or is a github username sufficient?
> > >
> > > We would prefer that issues be assigned to a Jira user. If you want to
> > > create an issue on behalf of an uncooperative person and assign it to
> > > yourself, you can do that, too.
> > >
> > > > Background:  I am following up on an item raised at the Arrow Sync
> call
> > > > today and trying to determine how much of the current required Arrow
> JIRA
> > > > process could be automated. Micah mentioned that the JIRA specifics
> might
> > > > be related to ASF governance process or requirements, and I am
> trying to
> > > > research what those are.
> > >
> > > We could easily automate the creation of a Jira issue using a bot of
> > > some kind. I don't think that creating issue is a hardship, though
> > > (having created thousands of them myself over the last 5 years). My
> > > position is that the hardship exists in the mind of the user and isn't
> > > actually real. It would be better if contributors would indicate the
> > > work they are proposing to contribute to the project before opening a
> > > pull request (so that others know that someone is working on
> > > something), but I understand that not everyone is going to do that.
> > >
> > > > I googled around but could not find anything at the Arrow or ASF
> level
> > > > about *WHY* Arrow has the current JIRA process requirements (though
> the
> > > > required process itself is well documented):
> > > >
> > > > Places I looked
> > > > * https://infra.apache.org/policies.html
> > > > *
> > > >
> > >
> https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
> > > > * http://www.apache.org/licenses/contributor-agreements.html
> > > > * http://www.apache.org/licenses/cla-faq.html
> > > > * various google searches
> > > >
> > > > I apologize if I missed something obvious.
> > > >
> > > > Any help would be most appreciated,
> > > > Andrew
> > >
>


Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
For trivial PRs that do not merit mention in the changelog you could
preface the issue title with something like "ARROW-XXX" and we can
modify the merge tool to bypass the consistency check for these. I
think some other Apache projects do this. I can understand how it
might seem like a nuisance to get a Jira when fixing a typo in a
README, so this is easy to fix.

For contributors doing non-trivial work, I think we want to try to get
people in the habit of putting out there what they are working on.
That's the thing that's most consistent with "The Apache Way" — write
things down, make plans in the open, allow others to see what is going
on and not have the roadmap existing exclusively in people's minds.

On Wed, Feb 17, 2021 at 3:41 PM Andrew Lamb  wrote:
>
> Thanks for the background Wes. This is exactly what I was looking for.
>
> I think using JIRA for the single source of truth / project management has
> lots of value and I don't want to propose changing that. I am trying to
> lower the barrier to contributing to Arrow even more.
>
> While I agree creating JIRA tickets is not hard, it is simply a few more
> steps for every PR and every contributor. The overhead is that much more if
> you don't already have a JIRA account -- if I can avoid just a few more
> steps and get a few more contributors I will consider it a win.
>
> Given this info, I will do some research into the technical options, and
> make a more concrete proposal / prototype for automation in a while.
>
> Thanks again,
> Andrew
>
> On Wed, Feb 17, 2021 at 1:28 PM Wes McKinney  wrote:
>
> > hi Andrew,
> >
> > There isn't a hard requirement. It's a culture thing where the purpose
> > of Jira issues is to create a changelog and for developers to
> > communicate publicly what work they are proposing to perform in the
> > project. We decided by consensus (essentially) that having a single
> > point of truth for developer activity in the project was a good idea.
> >
> > On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb  wrote:
> > >
> > > Can someone tell me / point me at what the actual "requirements" for
> > using
> > > JIRA in Apache Arrow are?
> > >
> > > Specifically, I would like to know:
> > >
> > > 1. Where does the requirement for each commit to have a JIRA ticket come
> > > from? (Is that Apache Arrow specific, or is it a more general Apache
> > > governance requirement? Something else?)
> > >
> > > 2. Does each commit need to be associated with a specific JIRA user
> > > account, or is a github username sufficient?
> >
> > We would prefer that issues be assigned to a Jira user. If you want to
> > create an issue on behalf of an uncooperative person and assign it to
> > yourself, you can do that, too.
> >
> > > Background:  I am following up on an item raised at the Arrow Sync call
> > > today and trying to determine how much of the current required Arrow JIRA
> > > process could be automated. Micah mentioned that the JIRA specifics might
> > > be related to ASF governance process or requirements, and I am trying to
> > > research what those are.
> >
> > We could easily automate the creation of a Jira issue using a bot of
> > some kind. I don't think that creating issue is a hardship, though
> > (having created thousands of them myself over the last 5 years). My
> > position is that the hardship exists in the mind of the user and isn't
> > actually real. It would be better if contributors would indicate the
> > work they are proposing to contribute to the project before opening a
> > pull request (so that others know that someone is working on
> > something), but I understand that not everyone is going to do that.
> >
> > > I googled around but could not find anything at the Arrow or ASF level
> > > about *WHY* Arrow has the current JIRA process requirements (though the
> > > required process itself is well documented):
> > >
> > > Places I looked
> > > * https://infra.apache.org/policies.html
> > > *
> > >
> > https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
> > > * http://www.apache.org/licenses/contributor-agreements.html
> > > * http://www.apache.org/licenses/cla-faq.html
> > > * various google searches
> > >
> > > I apologize if I missed something obvious.
> > >
> > > Any help would be most appreciated,
> > > Andrew
> >


Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread talk
> It's unclear to me that this needs to be introduced into the top-level

Similar thing to columnar format, How to store interval like 1 month 1 day 1 
hour? It’s not possible to do it without converting 1 month to 30 days, which 
is a bad way.

> On 17 Feb 2021, at 21:02, Wes McKinney  wrote:
> 
> It's unclear to me that this needs to be introduced into the top-level
> columnar format without more analysis — have you considered
> implementing this for DataFusion as an extension type for the time
> being?
> 
> On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me  
> mailto:t...@dmtry.me>> wrote:
>> 
>> Hi,
>> 
>> For now, There are only two types of IntervalUnit inside Arrow:
>> 
>> - YearMonth - month stored as int32
>> - DayTime - days as int32 and time in milliseconds  as in32. Total (64 bites)
>> 
>> Since DF is using Arrow, It’s not possible to store “Complex” intervals such 
>> 1 MONTH 1 DAY 1 HOUR.
>> I think, the best way to understand the problem will be to read a comment 
>> from DF codebase: 
>> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
>> 
>>// Interval is tricky thing
>>// 1 day is not 24 hours because timezones, 1 year != 365/364! 30 
>> days != 1 month
>>// The true way to store and calculate intervals is to store it as it 
>> defined
>>// Due the fact that Arrow supports only two types YearMonth (month) 
>> and DayTime (day, time)
>>// It's not possible to store complex intervals
>>// It's possible to do select (NOW() + INTERVAL '1 year') + INTERVAL 
>> '1 day'; as workaround
>>if result_month != 0 && (result_days != 0 || result_millis != 0) {
>>return Err(DataFusionError::NotImplemented(format!(
>>"DF does not support intervals that have both a Year/Month 
>> part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try breaking the 
>> interval into two parts, one with Year/Month and the other with 
>> Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + INTERVAL '1 
>> day'",
>>value
>>)));
>>}
>> 
>> 
>> 
>> I prepared a PR https://github.com/apache/arrow/pull/9516/files 
>>  
>> > > that introduce a new type 
>> for IntervalUnit called Complex, that store both YearMonth and DayTime to 
>> support complex interval.
>> I didn’t find any page/documentation on how to do RFC in Arrow protocol, so 
>> can anyone point me to it or PR with email will be enough?
>> 
>> Thanks.



Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
Thanks for the background Wes. This is exactly what I was looking for.

I think using JIRA for the single source of truth / project management has
lots of value and I don't want to propose changing that. I am trying to
lower the barrier to contributing to Arrow even more.

While I agree creating JIRA tickets is not hard, it is simply a few more
steps for every PR and every contributor. The overhead is that much more if
you don't already have a JIRA account -- if I can avoid just a few more
steps and get a few more contributors I will consider it a win.

Given this info, I will do some research into the technical options, and
make a more concrete proposal / prototype for automation in a while.

Thanks again,
Andrew

On Wed, Feb 17, 2021 at 1:28 PM Wes McKinney  wrote:

> hi Andrew,
>
> There isn't a hard requirement. It's a culture thing where the purpose
> of Jira issues is to create a changelog and for developers to
> communicate publicly what work they are proposing to perform in the
> project. We decided by consensus (essentially) that having a single
> point of truth for developer activity in the project was a good idea.
>
> On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb  wrote:
> >
> > Can someone tell me / point me at what the actual "requirements" for
> using
> > JIRA in Apache Arrow are?
> >
> > Specifically, I would like to know:
> >
> > 1. Where does the requirement for each commit to have a JIRA ticket come
> > from? (Is that Apache Arrow specific, or is it a more general Apache
> > governance requirement? Something else?)
> >
> > 2. Does each commit need to be associated with a specific JIRA user
> > account, or is a github username sufficient?
>
> We would prefer that issues be assigned to a Jira user. If you want to
> create an issue on behalf of an uncooperative person and assign it to
> yourself, you can do that, too.
>
> > Background:  I am following up on an item raised at the Arrow Sync call
> > today and trying to determine how much of the current required Arrow JIRA
> > process could be automated. Micah mentioned that the JIRA specifics might
> > be related to ASF governance process or requirements, and I am trying to
> > research what those are.
>
> We could easily automate the creation of a Jira issue using a bot of
> some kind. I don't think that creating issue is a hardship, though
> (having created thousands of them myself over the last 5 years). My
> position is that the hardship exists in the mind of the user and isn't
> actually real. It would be better if contributors would indicate the
> work they are proposing to contribute to the project before opening a
> pull request (so that others know that someone is working on
> something), but I understand that not everyone is going to do that.
>
> > I googled around but could not find anything at the Arrow or ASF level
> > about *WHY* Arrow has the current JIRA process requirements (though the
> > required process itself is well documented):
> >
> > Places I looked
> > * https://infra.apache.org/policies.html
> > *
> >
> https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
> > * http://www.apache.org/licenses/contributor-agreements.html
> > * http://www.apache.org/licenses/cla-faq.html
> > * various google searches
> >
> > I apologize if I missed something obvious.
> >
> > Any help would be most appreciated,
> > Andrew
>


Re: Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Wes McKinney
hi Andrew,

There isn't a hard requirement. It's a culture thing where the purpose
of Jira issues is to create a changelog and for developers to
communicate publicly what work they are proposing to perform in the
project. We decided by consensus (essentially) that having a single
point of truth for developer activity in the project was a good idea.

On Wed, Feb 17, 2021 at 12:09 PM Andrew Lamb  wrote:
>
> Can someone tell me / point me at what the actual "requirements" for using
> JIRA in Apache Arrow are?
>
> Specifically, I would like to know:
>
> 1. Where does the requirement for each commit to have a JIRA ticket come
> from? (Is that Apache Arrow specific, or is it a more general Apache
> governance requirement? Something else?)
>
> 2. Does each commit need to be associated with a specific JIRA user
> account, or is a github username sufficient?

We would prefer that issues be assigned to a Jira user. If you want to
create an issue on behalf of an uncooperative person and assign it to
yourself, you can do that, too.

> Background:  I am following up on an item raised at the Arrow Sync call
> today and trying to determine how much of the current required Arrow JIRA
> process could be automated. Micah mentioned that the JIRA specifics might
> be related to ASF governance process or requirements, and I am trying to
> research what those are.

We could easily automate the creation of a Jira issue using a bot of
some kind. I don't think that creating issue is a hardship, though
(having created thousands of them myself over the last 5 years). My
position is that the hardship exists in the mind of the user and isn't
actually real. It would be better if contributors would indicate the
work they are proposing to contribute to the project before opening a
pull request (so that others know that someone is working on
something), but I understand that not everyone is going to do that.

> I googled around but could not find anything at the Arrow or ASF level
> about *WHY* Arrow has the current JIRA process requirements (though the
> required process itself is well documented):
>
> Places I looked
> * https://infra.apache.org/policies.html
> *
> https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
> * http://www.apache.org/licenses/contributor-agreements.html
> * http://www.apache.org/licenses/cla-faq.html
> * various google searches
>
> I apologize if I missed something obvious.
>
> Any help would be most appreciated,
> Andrew


Requirements on JIRA usage in Apache Arrow

2021-02-17 Thread Andrew Lamb
Can someone tell me / point me at what the actual "requirements" for using
JIRA in Apache Arrow are?

Specifically, I would like to know:

1. Where does the requirement for each commit to have a JIRA ticket come
from? (Is that Apache Arrow specific, or is it a more general Apache
governance requirement? Something else?)

2. Does each commit need to be associated with a specific JIRA user
account, or is a github username sufficient?

Background:  I am following up on an item raised at the Arrow Sync call
today and trying to determine how much of the current required Arrow JIRA
process could be automated. Micah mentioned that the JIRA specifics might
be related to ASF governance process or requirements, and I am trying to
research what those are.

I googled around but could not find anything at the Arrow or ASF level
about *WHY* Arrow has the current JIRA process requirements (though the
required process itself is well documented):

Places I looked
* https://infra.apache.org/policies.html
*
https://arrow.apache.org/docs/developers/contributing.html#report-bugs-and-propose-features
* http://www.apache.org/licenses/contributor-agreements.html
* http://www.apache.org/licenses/cla-faq.html
* various google searches

I apologize if I missed something obvious.

Any help would be most appreciated,
Andrew


Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread Wes McKinney
It's unclear to me that this needs to be introduced into the top-level
columnar format without more analysis — have you considered
implementing this for DataFusion as an extension type for the time
being?

On Wed, Feb 17, 2021 at 11:59 AM t...@dmtry.me  wrote:
>
> Hi,
>
> For now, There are only two types of IntervalUnit inside Arrow:
>
> - YearMonth - month stored as int32
> - DayTime - days as int32 and time in milliseconds  as in32. Total (64 bites)
>
> Since DF is using Arrow, It’s not possible to store “Complex” intervals such 
> 1 MONTH 1 DAY 1 HOUR.
> I think, the best way to understand the problem will be to read a comment 
> from DF codebase: 
> https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148
>
> // Interval is tricky thing
> // 1 day is not 24 hours because timezones, 1 year != 365/364! 30 
> days != 1 month
> // The true way to store and calculate intervals is to store it as it 
> defined
> // Due the fact that Arrow supports only two types YearMonth (month) 
> and DayTime (day, time)
> // It's not possible to store complex intervals
> // It's possible to do select (NOW() + INTERVAL '1 year') + INTERVAL 
> '1 day'; as workaround
> if result_month != 0 && (result_days != 0 || result_millis != 0) {
> return Err(DataFusionError::NotImplemented(format!(
> "DF does not support intervals that have both a Year/Month 
> part as well as Days/Hours/Mins/Seconds: {:?}. Hint: try breaking the 
> interval into two parts, one with Year/Month and the other with 
> Days/Hours/Mins/Seconds - e.g. (NOW() + INTERVAL '1 year') + INTERVAL '1 
> day'",
> value
> )));
> }
>
>
>
> I prepared a PR https://github.com/apache/arrow/pull/9516/files 
>  that introduce a new type 
> for IntervalUnit called Complex, that store both YearMonth and DayTime to 
> support complex interval.
> I didn’t find any page/documentation on how to do RFC in Arrow protocol, so 
> can anyone point me to it or PR with email will be enough?
>
> Thanks.


[Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-02-17 Thread t...@dmtry.me
Hi,

For now, There are only two types of IntervalUnit inside Arrow:

- YearMonth - month stored as int32
- DayTime - days as int32 and time in milliseconds  as in32. Total (64 bites)

Since DF is using Arrow, It’s not possible to store “Complex” intervals such 1 
MONTH 1 DAY 1 HOUR.
I think, the best way to understand the problem will be to read a comment from 
DF codebase: 
https://github.com/apache/arrow/blob/bca7d2fe84ccd8fc1129cb4d85448eb0779c52c3/rust/datafusion/src/sql/planner.rs#L1148

// Interval is tricky thing
// 1 day is not 24 hours because timezones, 1 year != 365/364! 30 days 
!= 1 month
// The true way to store and calculate intervals is to store it as it 
defined
// Due the fact that Arrow supports only two types YearMonth (month) 
and DayTime (day, time)
// It's not possible to store complex intervals
// It's possible to do select (NOW() + INTERVAL '1 year') + INTERVAL '1 
day'; as workaround
if result_month != 0 && (result_days != 0 || result_millis != 0) {
return Err(DataFusionError::NotImplemented(format!(
"DF does not support intervals that have both a Year/Month part 
as well as Days/Hours/Mins/Seconds: {:?}. Hint: try breaking the interval into 
two parts, one with Year/Month and the other with Days/Hours/Mins/Seconds - 
e.g. (NOW() + INTERVAL '1 year') + INTERVAL '1 day'",
value
)));
}



I prepared a PR https://github.com/apache/arrow/pull/9516/files 
 that introduce a new type for 
IntervalUnit called Complex, that store both YearMonth and DayTime to support 
complex interval.
I didn’t find any page/documentation on how to do RFC in Arrow protocol, so can 
anyone point me to it or PR with email will be enough?

Thanks.

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
This certainly sounds good to me.

Cheers, Gidon


On Wed, Feb 17, 2021 at 7:36 PM Antoine Pitrou  wrote:

>
> I don't think any notion of threading should be present in the
> implementation, except for the required locks around shared structures.
>  I don't know where the idea of a "main thread" comes from, but it
> probably shouldn't exist in a C++ library.
>
> Regards
>
> Antoine.
>
>
>
> Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> > Just to clarify. There are two options, which one do you refer to? A
> design
> > with a main thread that handles projections and the keys (relevant for
> the
> > projected columns); or the current code with any thread allowed to handle
> > full file reading, inc the footer, column projections and their keys? Can
> > you finalize this with Micah?
> > The good news is, Tham is still interested to resume this work, and is ok
> > with either option. Please let her know whether the current threading
> model
> > stays, or should be modified with the changes proposed in the doc (for
> the
> > latter, some guidance with the details would be needed).
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou 
> wrote:
> >
> >>
> >>
> >> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> >>> From the doc,
> >>> "To maintain consistency with the style of parquet-cpp, the above
> >>> structures should not be explicitly synchronized with individual
> mutexes.
> >>> In the case of a parquet::arrow::FileReader, the request to read a
> given
> >>> selection of row groups and columns is issued from a single main
> thread.
> >>> Note that this does require that all keys required for a read are
> >> assembled
> >>> on the main thread so that DecryptionKeyRetriever objects are not
> >> directly
> >>> accessing any caches"
> >>>
> >>> The current PR code doesn't require a single main thread. Any thread
> can
> >>> read any file, both footer and pages. So the key cache is shared, to
> save
> >>> N-1 interactions with the KMS server.
> >>
> >> I don't think there's any contention on this.  IMHO the only concerns
> >> are about the implementation, not the semantics.
> >>
> >> Best regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Cheers, Gidon
> >>>
> >>>
> >>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou 
> >> wrote:
> >>>
> 
>  I'm not sure a threading model is expected for an encryption layer.
> Am
>  I missing something?
> 
>  Regards
> 
>  Antoine.
> 
> 
>  Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > Precisely, the main change is in the threading model. Afaik, the
> >> document
> > proposes a model that fits pandas, but might be problematic for other
>  users
> > of this library.
> > Technically, this is not showstopper though; if the community decides
> >> on
> > this model, it will be compatible with the high-level encryption
> >> design;
> > but the change implementation would need to be done by pandas experts
>  (not
> > us; but we'll help where we can).
> > Micah, you know this subject (and the community) better than we do -
> >> we'd
> > much appreciate it if you'd take a lead on removing this roadblock.
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> > wrote:
> >
> >> I think some of the comments might be conflicting.  One of the
> >> concerns
> >> (that I would need to refresh myself on to offer an opinion which
> was
> >> covered in Ben's doc) was the threading model we expect in the
> >> library.
> >>
> >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou 
>  wrote:
> >>
> >>>
> >>> Hi Gidon,
> >>>
> >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
>  Regarding the high-level layer, I think it waits for a progress at
> 
> >>>
> >>
> 
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
>  No activity there since last November. This is unfortunate,
> because
> >> Tham
>  has put a lot of work in coding the high-level layer (and
> addressing
> >> 200+
>  review comments) in the PR
> >> https://github.com/apache/arrow/pull/8023.
> >>> The
>  code is functional, compatible with the Java version in
> parquet-mr,
>  and
> >>> can
>  be updated with the threading changes in the doc above. I hope all
>  this
>  good work will not be wasted.
> >>>
> >>> I'm sorry for the possibly frustrating process.  Looking at the PR,
> >>> though, it seems a bunch of comments were not addressed.  Is it
>  possible
> >>> to go through them and ensure they get an answer and/or a
> resolution?
> >>>
> >>> Best regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> 
>  Cheers, Gidon
> 
> 
>  On Sat, Feb 13, 2021 at 

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Antoine Pitrou


I don't think any notion of threading should be present in the
implementation, except for the required locks around shared structures.
 I don't know where the idea of a "main thread" comes from, but it
probably shouldn't exist in a C++ library.

Regards

Antoine.



Le 17/02/2021 à 18:34, Gidon Gershinsky a écrit :
> Just to clarify. There are two options, which one do you refer to? A design
> with a main thread that handles projections and the keys (relevant for the
> projected columns); or the current code with any thread allowed to handle
> full file reading, inc the footer, column projections and their keys? Can
> you finalize this with Micah?
> The good news is, Tham is still interested to resume this work, and is ok
> with either option. Please let her know whether the current threading model
> stays, or should be modified with the changes proposed in the doc (for the
> latter, some guidance with the details would be needed).
> 
> Cheers, Gidon
> 
> 
> On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou  wrote:
> 
>>
>>
>> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
>>> From the doc,
>>> "To maintain consistency with the style of parquet-cpp, the above
>>> structures should not be explicitly synchronized with individual mutexes.
>>> In the case of a parquet::arrow::FileReader, the request to read a given
>>> selection of row groups and columns is issued from a single main thread.
>>> Note that this does require that all keys required for a read are
>> assembled
>>> on the main thread so that DecryptionKeyRetriever objects are not
>> directly
>>> accessing any caches"
>>>
>>> The current PR code doesn't require a single main thread. Any thread can
>>> read any file, both footer and pages. So the key cache is shared, to save
>>> N-1 interactions with the KMS server.
>>
>> I don't think there's any contention on this.  IMHO the only concerns
>> are about the implementation, not the semantics.
>>
>> Best regards
>>
>> Antoine.
>>
>>
>>>
>>> Cheers, Gidon
>>>
>>>
>>> On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou 
>> wrote:
>>>

 I'm not sure a threading model is expected for an encryption layer.  Am
 I missing something?

 Regards

 Antoine.


 Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> Precisely, the main change is in the threading model. Afaik, the
>> document
> proposes a model that fits pandas, but might be problematic for other
 users
> of this library.
> Technically, this is not showstopper though; if the community decides
>> on
> this model, it will be compatible with the high-level encryption
>> design;
> but the change implementation would need to be done by pandas experts
 (not
> us; but we'll help where we can).
> Micah, you know this subject (and the community) better than we do -
>> we'd
> much appreciate it if you'd take a lead on removing this roadblock.
>
> Cheers, Gidon
>
>
> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield >>
> wrote:
>
>> I think some of the comments might be conflicting.  One of the
>> concerns
>> (that I would need to refresh myself on to offer an opinion which was
>> covered in Ben's doc) was the threading model we expect in the
>> library.
>>
>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou 
 wrote:
>>
>>>
>>> Hi Gidon,
>>>
>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
 Regarding the high-level layer, I think it waits for a progress at

>>>
>>

>> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
 No activity there since last November. This is unfortunate, because
>> Tham
 has put a lot of work in coding the high-level layer (and addressing
>> 200+
 review comments) in the PR
>> https://github.com/apache/arrow/pull/8023.
>>> The
 code is functional, compatible with the Java version in parquet-mr,
 and
>>> can
 be updated with the threading changes in the doc above. I hope all
 this
 good work will not be wasted.
>>>
>>> I'm sorry for the possibly frustrating process.  Looking at the PR,
>>> though, it seems a bunch of comments were not addressed.  Is it
 possible
>>> to go through them and ensure they get an answer and/or a resolution?
>>>
>>> Best regards
>>>
>>> Antoine.
>>>
>>>
>>>

 Cheers, Gidon


 On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
 emkornfi...@gmail.com
>>>
 wrote:

> My thoughts:
> 1.  I've lost track of the higher level encryption implementation
>> in
>>> C++.
> I think we were trying to come to a consensus on the
>> threading/thread
> safety model?
>
> 2.  I'm open to exposing the lower level encryption libraries in
>> python
> (without appropriate 

Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
Just to clarify. There are two options, which one do you refer to? A design
with a main thread that handles projections and the keys (relevant for the
projected columns); or the current code with any thread allowed to handle
full file reading, inc the footer, column projections and their keys? Can
you finalize this with Micah?
The good news is, Tham is still interested to resume this work, and is ok
with either option. Please let her know whether the current threading model
stays, or should be modified with the changes proposed in the doc (for the
latter, some guidance with the details would be needed).

Cheers, Gidon


On Wed, Feb 17, 2021 at 2:40 PM Antoine Pitrou  wrote:

>
>
> Le 17/02/2021 à 12:47, Gidon Gershinsky a écrit :
> > From the doc,
> > "To maintain consistency with the style of parquet-cpp, the above
> > structures should not be explicitly synchronized with individual mutexes.
> > In the case of a parquet::arrow::FileReader, the request to read a given
> > selection of row groups and columns is issued from a single main thread.
> > Note that this does require that all keys required for a read are
> assembled
> > on the main thread so that DecryptionKeyRetriever objects are not
> directly
> > accessing any caches"
> >
> > The current PR code doesn't require a single main thread. Any thread can
> > read any file, both footer and pages. So the key cache is shared, to save
> > N-1 interactions with the KMS server.
>
> I don't think there's any contention on this.  IMHO the only concerns
> are about the implementation, not the semantics.
>
> Best regards
>
> Antoine.
>
>
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou 
> wrote:
> >
> >>
> >> I'm not sure a threading model is expected for an encryption layer.  Am
> >> I missing something?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> >>> Precisely, the main change is in the threading model. Afaik, the
> document
> >>> proposes a model that fits pandas, but might be problematic for other
> >> users
> >>> of this library.
> >>> Technically, this is not showstopper though; if the community decides
> on
> >>> this model, it will be compatible with the high-level encryption
> design;
> >>> but the change implementation would need to be done by pandas experts
> >> (not
> >>> us; but we'll help where we can).
> >>> Micah, you know this subject (and the community) better than we do -
> we'd
> >>> much appreciate it if you'd take a lead on removing this roadblock.
> >>>
> >>> Cheers, Gidon
> >>>
> >>>
> >>> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield  >
> >>> wrote:
> >>>
>  I think some of the comments might be conflicting.  One of the
> concerns
>  (that I would need to refresh myself on to offer an opinion which was
>  covered in Ben's doc) was the threading model we expect in the
> library.
> 
>  On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou 
> >> wrote:
> 
> >
> > Hi Gidon,
> >
> > Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
> >> Regarding the high-level layer, I think it waits for a progress at
> >>
> >
> 
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
> >> No activity there since last November. This is unfortunate, because
>  Tham
> >> has put a lot of work in coding the high-level layer (and addressing
>  200+
> >> review comments) in the PR
> https://github.com/apache/arrow/pull/8023.
> > The
> >> code is functional, compatible with the Java version in parquet-mr,
> >> and
> > can
> >> be updated with the threading changes in the doc above. I hope all
> >> this
> >> good work will not be wasted.
> >
> > I'm sorry for the possibly frustrating process.  Looking at the PR,
> > though, it seems a bunch of comments were not addressed.  Is it
> >> possible
> > to go through them and ensure they get an answer and/or a resolution?
> >
> > Best regards
> >
> > Antoine.
> >
> >
> >
> >>
> >> Cheers, Gidon
> >>
> >>
> >> On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> >> emkornfi...@gmail.com
> >
> >> wrote:
> >>
> >>> My thoughts:
> >>> 1.  I've lost track of the higher level encryption implementation
> in
> > C++.
> >>> I think we were trying to come to a consensus on the
> threading/thread
> >>> safety model?
> >>>
> >>> 2.  I'm open to exposing the lower level encryption libraries in
>  python
> >>> (without appropriate namespacing/communication).  It seems at least
>  for
> >>> reading, there is potentially less harm (I'll caveat that with I'm
>  not a
> >>> security expert).  Are both the low level read and write
>  implementations
> >>> necessary?  (it probably makes sense to have a few smaller PRs for
> > exposing
> >>> this functionality anyways).
> 

Re: "2.0.1" and "3.0.1" versions on JIRA

2021-02-17 Thread Wes McKinney
I think 2.0.1 can be removed. I doubt that a 3.0.1 patch release is going
to happen either but it can be removed later.

On Wed, Feb 17, 2021 at 9:41 AM Antoine Pitrou  wrote:

>
> Hi,
>
> There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged
> with a number of issues:
> https://issues.apache.org/jira/projects/ARROW/versions/12349263
> https://issues.apache.org/jira/projects/ARROW/versions/12349610
>
> What should we do with them? It seems that "2.0.1" at least should be
> closed or removed.
>
> Regards
>
> Antoine.
>


"2.0.1" and "3.0.1" versions on JIRA

2021-02-17 Thread Antoine Pitrou


Hi,

There are versions named "2.0.1" and "3.0.1" on JIRA, they are tagged
with a number of issues:
https://issues.apache.org/jira/projects/ARROW/versions/12349263
https://issues.apache.org/jira/projects/ARROW/versions/12349610

What should we do with them? It seems that "2.0.1" at least should be
closed or removed.

Regards

Antoine.


Cross-endianness IPC support in Arrow C++

2021-02-17 Thread Antoine Pitrou


Hello,

I would like to announce that we have just merged
https://github.com/apache/arrow/pull/7507, which implements - on the C++
side - endianness conversion when reading IPC data with non-native
endianness.

This means that IPC and Flight communication using Arrow C++ should be
possible between systems with different native endiannesses.  This
feature is experimental for now: there are basic integration tests, but
it should be exercised a bit more before we declare it stable.

We've also added an entry in the feature compatibility matrix:
https://github.com/apache/arrow/blob/master/docs/source/status.rst#ipc-format

Regards

Antoine.


Re: Arrow sync call February 17 at 12:00 US/Eastern, 17:00 UTC

2021-02-17 Thread Antoine Pitrou


Le 17/02/2021 à 12:07, Andrew Lamb a écrit :
> *Proposal*:  Allow Rust and other implementations release additional point
> / maintenance  versions at a different cadences, out of lockstep with the
> major arrow releases. We could still release the Rust library as part of
> the major Arrow releases, but we would release both bug fix / patch
> versions as well as intermediate feature releases of our various crates

No opposition from me.

Best regards

Antoine.


Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Gidon Gershinsky
>From the doc,
"To maintain consistency with the style of parquet-cpp, the above
structures should not be explicitly synchronized with individual mutexes.
In the case of a parquet::arrow::FileReader, the request to read a given
selection of row groups and columns is issued from a single main thread.
Note that this does require that all keys required for a read are assembled
on the main thread so that DecryptionKeyRetriever objects are not directly
accessing any caches"

The current PR code doesn't require a single main thread. Any thread can
read any file, both footer and pages. So the key cache is shared, to save
N-1 interactions with the KMS server.

Cheers, Gidon


On Wed, Feb 17, 2021 at 12:49 PM Antoine Pitrou  wrote:

>
> I'm not sure a threading model is expected for an encryption layer.  Am
> I missing something?
>
> Regards
>
> Antoine.
>
>
> Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> > Precisely, the main change is in the threading model. Afaik, the document
> > proposes a model that fits pandas, but might be problematic for other
> users
> > of this library.
> > Technically, this is not showstopper though; if the community decides on
> > this model, it will be compatible with the high-level encryption design;
> > but the change implementation would need to be done by pandas experts
> (not
> > us; but we'll help where we can).
> > Micah, you know this subject (and the community) better than we do - we'd
> > much appreciate it if you'd take a lead on removing this roadblock.
> >
> > Cheers, Gidon
> >
> >
> > On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield 
> > wrote:
> >
> >> I think some of the comments might be conflicting.  One of the concerns
> >> (that I would need to refresh myself on to offer an opinion which was
> >> covered in Ben's doc) was the threading model we expect in the library.
> >>
> >> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou 
> wrote:
> >>
> >>>
> >>> Hi Gidon,
> >>>
> >>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
>  Regarding the high-level layer, I think it waits for a progress at
> 
> >>>
> >>
> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
>  No activity there since last November. This is unfortunate, because
> >> Tham
>  has put a lot of work in coding the high-level layer (and addressing
> >> 200+
>  review comments) in the PR https://github.com/apache/arrow/pull/8023.
> >>> The
>  code is functional, compatible with the Java version in parquet-mr,
> and
> >>> can
>  be updated with the threading changes in the doc above. I hope all
> this
>  good work will not be wasted.
> >>>
> >>> I'm sorry for the possibly frustrating process.  Looking at the PR,
> >>> though, it seems a bunch of comments were not addressed.  Is it
> possible
> >>> to go through them and ensure they get an answer and/or a resolution?
> >>>
> >>> Best regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> 
>  Cheers, Gidon
> 
> 
>  On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
>  wrote:
> 
> > My thoughts:
> > 1.  I've lost track of the higher level encryption implementation in
> >>> C++.
> > I think we were trying to come to a consensus on the threading/thread
> > safety model?
> >
> > 2.  I'm open to exposing the lower level encryption libraries in
> >> python
> > (without appropriate namespacing/communication).  It seems at least
> >> for
> > reading, there is potentially less harm (I'll caveat that with I'm
> >> not a
> > security expert).  Are both the low level read and write
> >> implementations
> > necessary?  (it probably makes sense to have a few smaller PRs for
> >>> exposing
> > this functionality anyways).
> >
> >
> >
> > On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> > ita...@pythonspeed.com> wrote:
> >
> >> Hi,
> >>
> >> Since the PR for high-level C++ Parquet encryption API appears
> >> stalled
> >>> (
> >> https://github.com/apache/arrow/pull/8023), I'm looking into
> >> exposing
> > the
> >> low-level Parquet encryption API to Python.
> >>
> >> Arguments for doing this: the low-level API is all the users I'm
> >>> talking
> >> to need, at the moment, so it's plausible others would also find
> some
> >> benefit in having the Pyarrow API expose low-level Parquet
> >> encryption.
> > Then
> >> again, it might only be this one company and no one else cares.
> >>
> >> The arguments against, per Gidon Gershinsky:
> >>
> >>>  * security: low-level encryption API is easy to misuse (eg giving
> >> the
> >> same keys for a number of different files; this'd break the AES GCM
> >> cipher). The high-level encryption layer handles that by applying
> > envelope
> >> encryption and other best practices in data security. Also, this
> >> layer
> >>> is
> >> maintained by the 

Re: Arrow sync call February 17 at 12:00 US/Eastern, 17:00 UTC

2021-02-17 Thread Andrew Lamb
I have two items I would like to propose for the agenda if there is time:

1. Manual creation of JIRA Tickets
*Background/Issue*: Currently all contributors are required to make a JIRA
account and do some mechanical JIRA creation to create well formed Arrow
PRs. This is mindless work and people who are used to it may forget the
barrier it imposes on casual contributions and burden on maintainers to
keep JIRA synced.
*Proposal*: Write a bot / script that uses the title of new PRs to open the
appropriate JIRA / components. Devs / maintainers would only have to ensure
the title of PRs started with the correct components.

2. Slow Release Process + Dearth  of Maintenance Releases
*Background / Issues: I*nfrequent releases (every three months), and lack
of maintenance patchsets mean that many users of the Rust Arrow crate need
to pull in some version of the `master` branch as that is the only way they
can get bug fixes in the intervening three months between releases-- this
causes friction with users, and pressure to keep APIs more compatible than
befit a project at the Rust Implementation's stage of maturity.
*Proposal*:  Allow Rust and other implementations release additional point
/ maintenance  versions at a different cadences, out of lockstep with the
major arrow releases. We could still release the Rust library as part of
the major Arrow releases, but we would release both bug fix / patch
versions as well as intermediate feature releases of our various crates

On Tue, Feb 16, 2021 at 10:53 PM Neal Richardson <
neal.p.richard...@gmail.com> wrote:

> Hi all,
> Reminder that our biweekly call is coming up at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be shared with the mailing list afterward.
>
> Neal
>


Re: Exposing low-level Parquet encryption to Python user (or, maybe not)

2021-02-17 Thread Antoine Pitrou


I'm not sure a threading model is expected for an encryption layer.  Am
I missing something?

Regards

Antoine.


Le 17/02/2021 à 06:59, Gidon Gershinsky a écrit :
> Precisely, the main change is in the threading model. Afaik, the document
> proposes a model that fits pandas, but might be problematic for other users
> of this library.
> Technically, this is not showstopper though; if the community decides on
> this model, it will be compatible with the high-level encryption design;
> but the change implementation would need to be done by pandas experts (not
> us; but we'll help where we can).
> Micah, you know this subject (and the community) better than we do - we'd
> much appreciate it if you'd take a lead on removing this roadblock.
> 
> Cheers, Gidon
> 
> 
> On Wed, Feb 17, 2021 at 6:08 AM Micah Kornfield 
> wrote:
> 
>> I think some of the comments might be conflicting.  One of the concerns
>> (that I would need to refresh myself on to offer an opinion which was
>> covered in Ben's doc) was the threading model we expect in the library.
>>
>> On Tue, Feb 16, 2021 at 8:03 AM Antoine Pitrou  wrote:
>>
>>>
>>> Hi Gidon,
>>>
>>> Le 16/02/2021 à 16:42, Gidon Gershinsky a écrit :
 Regarding the high-level layer, I think it waits for a progress at

>>>
>> https://docs.google.com/document/d/11qz84ajysvVo5ZAV9mXKOeh6ay4-xgkBrubggCP5220/edit?usp=sharing
 No activity there since last November. This is unfortunate, because
>> Tham
 has put a lot of work in coding the high-level layer (and addressing
>> 200+
 review comments) in the PR https://github.com/apache/arrow/pull/8023.
>>> The
 code is functional, compatible with the Java version in parquet-mr, and
>>> can
 be updated with the threading changes in the doc above. I hope all this
 good work will not be wasted.
>>>
>>> I'm sorry for the possibly frustrating process.  Looking at the PR,
>>> though, it seems a bunch of comments were not addressed.  Is it possible
>>> to go through them and ensure they get an answer and/or a resolution?
>>>
>>> Best regards
>>>
>>> Antoine.
>>>
>>>
>>>

 Cheers, Gidon


 On Sat, Feb 13, 2021 at 6:52 AM Micah Kornfield >>
 wrote:

> My thoughts:
> 1.  I've lost track of the higher level encryption implementation in
>>> C++.
> I think we were trying to come to a consensus on the threading/thread
> safety model?
>
> 2.  I'm open to exposing the lower level encryption libraries in
>> python
> (without appropriate namespacing/communication).  It seems at least
>> for
> reading, there is potentially less harm (I'll caveat that with I'm
>> not a
> security expert).  Are both the low level read and write
>> implementations
> necessary?  (it probably makes sense to have a few smaller PRs for
>>> exposing
> this functionality anyways).
>
>
>
> On Wed, Feb 10, 2021 at 7:10 AM Itamar Turner-Trauring <
> ita...@pythonspeed.com> wrote:
>
>> Hi,
>>
>> Since the PR for high-level C++ Parquet encryption API appears
>> stalled
>>> (
>> https://github.com/apache/arrow/pull/8023), I'm looking into
>> exposing
> the
>> low-level Parquet encryption API to Python.
>>
>> Arguments for doing this: the low-level API is all the users I'm
>>> talking
>> to need, at the moment, so it's plausible others would also find some
>> benefit in having the Pyarrow API expose low-level Parquet
>> encryption.
> Then
>> again, it might only be this one company and no one else cares.
>>
>> The arguments against, per Gidon Gershinsky:
>>
>>>  * security: low-level encryption API is easy to misuse (eg giving
>> the
>> same keys for a number of different files; this'd break the AES GCM
>> cipher). The high-level encryption layer handles that by applying
> envelope
>> encryption and other best practices in data security. Also, this
>> layer
>>> is
>> maintained by the community, meaning that future improvements and
> security
>> fixes can be upstreamed by anyone, and available to all.
>>>  * compatibility: parquet-mr implements the high-level encryption
> layer.
>> If we want the files produced by Spark/Presto/etc to be readable by
>> pandas/PyArrow (and vice versa), we need to provide the Arrow users
>>> with
>> the high-level API.
>>> ...
>>>
>>> The current situation is not ideal, it'd be good to merge the
> high-level
>> PR (and maybe hide the low level), but here we are; also, C++ is a
>> kind
> of
>> a low-level language; Python would expose it to a less experienced
> audience.
>>
>> (Source: https://issues.apache.org/jira/browse/ARROW-8040)
>>
>> I find the compatibility argument less compelling, that's readily
>> addressed by documentation. I am not a crypto expert so I can't
>>> evaluate
>> how risky exposing the low-level encryption APIs would be, but I can
>>> see

Re: Threading Improvements Proposal

2021-02-17 Thread Antoine Pitrou


Le 17/02/2021 à 05:20, Micah Kornfield a écrit :
>>
>> If a method could potentially run some kind of long term blocking I/O
>> wait then yes.  So reading / writing tables & datasets, IPC,
>> filesystem APIs, etc. will all need to adapt.  It doesn't have to be
>> all at once.  CPU only functions would remain as they are.  So table
>> manipulation, compute functions, etc. would remain as they are.  For
>> example, there would never be any advantage to creating an
>> asynchronous method to drop a column from a table.
> 
> 
> My main concern is around the "viralness" of Futures. I think they are good
> in some cases but can become hard to reason about/error prone if you aren't
> used to working with them day in/out.  I don't have any concrete
> recommendation at this point, just something we should be careful about
> when doing the refactoring.

I think it's ok to expose synchronous facades at key points in the Arrow
API, to avoid having to deal with futures when you don't need to.

Regards

Antoine.


[NIGHTLY] Arrow Build Report for Job nightly-2021-02-17-0

2021-02-17 Thread Crossbow


Arrow Build Report for Job nightly-2021-02-17-0

All tasks: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0

Failed Tasks:
- conda-linux-gcc-py36-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-drone-conda-linux-gcc-py36-aarch64
- conda-linux-gcc-py38-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-drone-conda-linux-gcc-py38-aarch64
- gandiva-jar-ubuntu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-gandiva-jar-ubuntu
- test-conda-cpp-valgrind:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-cpp-valgrind
- test-conda-python-3.7-dask-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-python-3.7-dask-latest
- test-conda-python-3.7-hdfs-3.2:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-python-3.7-hdfs-3.2
- test-conda-python-3.7-turbodbc-latest:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-python-3.7-turbodbc-latest
- test-conda-python-3.7-turbodbc-master:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-python-3.7-turbodbc-master
- test-conda-python-3.8-jpype:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-conda-python-3.8-jpype
- test-r-versions:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-test-r-versions
- test-ubuntu-18.04-docs:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-test-ubuntu-18.04-docs
- wheel-osx-high-sierra-cp36m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-high-sierra-cp36m
- wheel-osx-high-sierra-cp37m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-high-sierra-cp37m
- wheel-osx-high-sierra-cp38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-high-sierra-cp38
- wheel-osx-high-sierra-cp39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-high-sierra-cp39
- wheel-osx-mavericks-cp36m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-mavericks-cp36m
- wheel-osx-mavericks-cp37m:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-mavericks-cp37m
- wheel-osx-mavericks-cp38:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-mavericks-cp38
- wheel-osx-mavericks-cp39:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-wheel-osx-mavericks-cp39

Succeeded Tasks:
- centos-7-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-centos-7-amd64
- centos-8-amd64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-github-centos-8-amd64
- conda-clean:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-clean
- conda-linux-gcc-py36-cpu-r36:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py36-cpu-r36
- conda-linux-gcc-py36-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py36-cuda
- conda-linux-gcc-py37-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-drone-conda-linux-gcc-py37-aarch64
- conda-linux-gcc-py37-cpu-r40:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py37-cpu-r40
- conda-linux-gcc-py37-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py37-cuda
- conda-linux-gcc-py38-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py38-cpu
- conda-linux-gcc-py38-cuda:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py38-cuda
- conda-linux-gcc-py39-aarch64:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-drone-conda-linux-gcc-py39-aarch64
- conda-linux-gcc-py39-cpu:
  URL: 
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-02-17-0-azure-conda-linux-gcc-py39-cpu
- conda-linux-gcc-py39-cuda:
  URL: