Re: What's the proper procedure to publish a docker image to dockerhub?

2019-04-15 Thread Wes McKinney
It is not in compliance to publish any official Apache artifacts unless the
PMC has a release vote on them (or the source artifact that produces them).
You are free to publish a Docker image under a non-Apache dockerhub account
of course.

Wes

On Tue, Apr 16, 2019, 10:10 AM Micah Kornfield 
wrote:

> I'm not sure the policy here but I think if this something official then
> the PMC would have to set it up and control it.   Could someone on the PMC
> chime in?
>
> On Monday, April 15, 2019, Zhiyuan Zheng  wrote:
>
> > Thanks Alberto!
> >
> > If we are able to create an official repository solely for Apache Arrow,
> > it's more flexible to publish new images in future.
> >
> > How to create such a repository ?
> >
> > 16.04.2019, 01:27, "Alberto Ramón" :
> > > Hello Zhiyuan
> > >
> > > I can help you if you need help with this process
> > > The best option is request a offical repository for Apache Arrow
> Project
> > (se
> > > are the ones that start with '_' Redis example
> > > 
> > >
> > > On Mon, 15 Apr 2019 at 15:21, Zhiyuan Zheng 
> > > wrote:
> > >
> > >>  Hi,
> > >>
> > >>  DataFusion is a component which is an in-memory query engine using
> > Apache
> > >>  Arrow as the memory model.
> > >>
> > >>  I have created a Dockerfile for DataFusion (
> > >>  https://issues.apache.org/jira/browse/ARROW-4467) for it.
> > >>
> > >>  In order to help user to start using DataFusion for some simple real
> > world
> > >>  use cases, I would like to publish a docker image with tag
> > >>  'apache/arrow-datafusion' to the DockerHub.
> > >>
> > >>  What's the procedure to publish a docker image to DockerHub prefixed
> > with
> > >>  apache?
> > >>
> > >>  Cheers,
> > >>  Zhiyuan
> >
>


Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Micah Kornfield
To summarize my understanding of the thread so far, there seems to be
consensus on having a new distinct type for each "large" type.

There are some reservations around the "large" types being harder to
support in algorithmic implementations.

I'm curious Philipp, was there a concrete use-case that inspired you to
start the PR?

Also, this was brought up on another thread, but utility of the "large"
types might be limited in some languages (e.g. Java) until they support
buffer sizes larger then INT_MAX bytes.  I brought this up on the current
PR to decouple Netty and memory management from ArrowBuf [1], but the
consensus seems to be to handle any modifications in follow-up PRs (if they
are agreed upon).

Anything else people want to discuss before a vote on whether to allow the
additional types into the spec?

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4151




On Monday, April 15, 2019, Jacques Nadeau  wrote:

> I am not Jacques, but I will try to give my own point of view on this.
> >
>
> Thanks for making me laugh :)
>
> I think that this is unavoidable. Even with batches, taking an example of a
> > binary column where the mean size of the payload is 1mb, it limits to
> > batches of 2048 elements. This can become annoying pretty quickly.
> >
>
> Good example. I'm not sure columnar matters but I find it more useful than
> others.
>
> logical types and physical types
> >
>
> TLDR; It is painful no matter which model you pick.
>
> I definitely think we worked hard to go different on Arrow than Parquet. It
> was something I pushed consciously when we started as I found some of the
> patterns in Parquet to be quite challenging. Unfortunately, we went too far
> in some places in the Java code which tried to parallel the structure of
> the physical types directly (and thus the big refactor we did to reduce
> duplication last year -- props to Sidd, Bryan and the others who worked on
> that). I also think that we easily probably lost as much as we gained using
> the current model.
>
> I agree with Antoine both in his clean statement of the approaches and that
> sticking to the model we have today makes the most sense.
>
> On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > Thanks for the clarification Antoine, very insightful.
> >
> > I'd also vote for keeping the existing model for consistency.
> >
> > On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou 
> wrote:
> >
> > >
> > > Hi,
> > >
> > > I am not Jacques, but I will try to give my own point of view on this.
> > >
> > > The distinction between logical and physical types can be modelled in
> > > two different ways:
> > >
> > > 1) a physical type can denote several logical types, but a logical type
> > > can only have a single physical representation.  This is currently the
> > > Arrow model.
> > >
> > > 2) a physical type can denote several logical types, and a logical type
> > > can also be denoted by several physical types.  This is the Parquet
> > model.
> > >
> > > (theoretically, there are two other possible models, but they are not
> > > very interesting to consider, since they don't seem to cater to
> concrete
> > > use cases)
> > >
> > > Model 1 is obviously more restrictive, while model 2 is more flexible.
> > > Model 2 could be said "higher level"; you see something similar if you
> > > compare Python's and C++'s typing systems.  On the other hand, model 1
> > > provides a potentially simpler programming model for implementors of
> > > low-level kernels, as you can simply query the logical type of your
> data
> > > and you automatically know its physical type.
> > >
> > > The model chosen for Arrow is ingrained in its API.  If we want to
> > > change the model we'd better do it wholesale (implying probably a large
> > > refactoring and a significant number of unavoidable regressions) to
> > > avoid subjecting users to a confusing middle point.
> > >
> > > Also and as a sidenote, "convertibility" between different types can be
> > > a hairy subject... Having strict boundaries between types avoids being
> > > dragged into it too early.
> > >
> > >
> > > To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> > > should be a distinct logical type from List (resp. Binary), the same
> way
> > > Int64 is a distinct logical type from Int32.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > > > Hello,
> > > >
> > > > I would like understand where do we stand on logical types and
> physical
> > > > types. As I understand, this proposal is for the physical
> > representation.
> > > >
> > > > In the context of an execution engine, the concept of logical types
> > > becomes
> > > > more important as two physical representation might have the same
> > > semantical
> > > > values, e.g. LargeList and List where all values fits in the 32-bits.
> > A
> > > > more
> > > > complex example would be an Integer array and a 

What's the proper procedure to publish a docker image to dockerhub?

2019-04-15 Thread Micah Kornfield
I'm not sure the policy here but I think if this something official then
the PMC would have to set it up and control it.   Could someone on the PMC
chime in?

On Monday, April 15, 2019, Zhiyuan Zheng  wrote:

> Thanks Alberto!
>
> If we are able to create an official repository solely for Apache Arrow,
> it's more flexible to publish new images in future.
>
> How to create such a repository ?
>
> 16.04.2019, 01:27, "Alberto Ramón" :
> > Hello Zhiyuan
> >
> > I can help you if you need help with this process
> > The best option is request a offical repository for Apache Arrow Project
> (se
> > are the ones that start with '_' Redis example
> > 
> >
> > On Mon, 15 Apr 2019 at 15:21, Zhiyuan Zheng 
> > wrote:
> >
> >>  Hi,
> >>
> >>  DataFusion is a component which is an in-memory query engine using
> Apache
> >>  Arrow as the memory model.
> >>
> >>  I have created a Dockerfile for DataFusion (
> >>  https://issues.apache.org/jira/browse/ARROW-4467) for it.
> >>
> >>  In order to help user to start using DataFusion for some simple real
> world
> >>  use cases, I would like to publish a docker image with tag
> >>  'apache/arrow-datafusion' to the DockerHub.
> >>
> >>  What's the procedure to publish a docker image to DockerHub prefixed
> with
> >>  apache?
> >>
> >>  Cheers,
> >>  Zhiyuan
>


Re: What's the proper procedure to publish a docker image to dockerhub?

2019-04-15 Thread Zhiyuan Zheng
Thanks Alberto!

If we are able to create an official repository solely for Apache Arrow, it's 
more flexible to publish new images in future. 

How to create such a repository ?

16.04.2019, 01:27, "Alberto Ramón" :
> Hello Zhiyuan
>
> I can help you if you need help with this process
> The best option is request a offical repository for Apache Arrow Project (se
> are the ones that start with '_' Redis example
> 
>
> On Mon, 15 Apr 2019 at 15:21, Zhiyuan Zheng 
> wrote:
>
>>  Hi,
>>
>>  DataFusion is a component which is an in-memory query engine using Apache
>>  Arrow as the memory model.
>>
>>  I have created a Dockerfile for DataFusion (
>>  https://issues.apache.org/jira/browse/ARROW-4467) for it.
>>
>>  In order to help user to start using DataFusion for some simple real world
>>  use cases, I would like to publish a docker image with tag
>>  'apache/arrow-datafusion' to the DockerHub.
>>
>>  What's the procedure to publish a docker image to DockerHub prefixed with
>>  apache?
>>
>>  Cheers,
>>  Zhiyuan


[jira] [Created] (ARROW-5171) [C++] Use LESS instead of LOWER in compare enum option.

2019-04-15 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-5171:
-

 Summary: [C++] Use LESS instead of LOWER in compare enum option.
 Key: ARROW-5171
 URL: https://issues.apache.org/jira/browse/ARROW-5171
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


See https://github.com/apache/arrow/pull/3963#discussion_r275596603



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ARROW-3191: Making ArrowBuf work with arbitrary memory

2019-04-15 Thread Siddharth Teotia
I believe reader/writer indexes are typically used when we send buffers
over the wire -- so may not be necessary for all users of ArrowBuf.  I am
okay with the idea of providing a simple wrapper to ArrowBuf to manage the
reader/writer indexes with a couple of APIs. Note that some APIs like
writeInt, writeLong() bump the writer index unlike setInt/setLong
counterparts. JsonFileReader uses some of these APIs.



On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau  wrote:

> Hey Sidd,
>
> Thanks for pulling this together. This looks very promising. One quick
> thought: do we think the concept of the reader and writer index need to be
> on ArrowBuf? It seems like something that could be added as an additional
> decoration/wrapper when needed instead of being part of the core structure.
>
> On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia 
> wrote:
>
> > Hi All,
> >
> > I have put a PR with WIP changes. All the major set of changes have been
> > done to decouple the usage of ArrowBuf and reference management. The
> > ArrowBuf interface is much simpler and clean now.
> >
> > I believe there would be several folks in the community interested in
> these
> > changes so please feel free to take a look at the PR and provide your
> > feedback -- https://github.com/apache/arrow/pull/4151
> >
> > There is some cleanup needed (code doesn't compile yet) due to moving the
> > APIs but I have raised the PR to get an early feedback from the community
> > on the critical changes.
> >
> > Thanks,
> > Siddharth
> >
>


Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Jacques Nadeau
I am not Jacques, but I will try to give my own point of view on this.
>

Thanks for making me laugh :)

I think that this is unavoidable. Even with batches, taking an example of a
> binary column where the mean size of the payload is 1mb, it limits to
> batches of 2048 elements. This can become annoying pretty quickly.
>

Good example. I'm not sure columnar matters but I find it more useful than
others.

logical types and physical types
>

TLDR; It is painful no matter which model you pick.

I definitely think we worked hard to go different on Arrow than Parquet. It
was something I pushed consciously when we started as I found some of the
patterns in Parquet to be quite challenging. Unfortunately, we went too far
in some places in the Java code which tried to parallel the structure of
the physical types directly (and thus the big refactor we did to reduce
duplication last year -- props to Sidd, Bryan and the others who worked on
that). I also think that we easily probably lost as much as we gained using
the current model.

I agree with Antoine both in his clean statement of the approaches and that
sticking to the model we have today makes the most sense.

On Mon, Apr 15, 2019 at 11:05 AM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> Thanks for the clarification Antoine, very insightful.
>
> I'd also vote for keeping the existing model for consistency.
>
> On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou  wrote:
>
> >
> > Hi,
> >
> > I am not Jacques, but I will try to give my own point of view on this.
> >
> > The distinction between logical and physical types can be modelled in
> > two different ways:
> >
> > 1) a physical type can denote several logical types, but a logical type
> > can only have a single physical representation.  This is currently the
> > Arrow model.
> >
> > 2) a physical type can denote several logical types, and a logical type
> > can also be denoted by several physical types.  This is the Parquet
> model.
> >
> > (theoretically, there are two other possible models, but they are not
> > very interesting to consider, since they don't seem to cater to concrete
> > use cases)
> >
> > Model 1 is obviously more restrictive, while model 2 is more flexible.
> > Model 2 could be said "higher level"; you see something similar if you
> > compare Python's and C++'s typing systems.  On the other hand, model 1
> > provides a potentially simpler programming model for implementors of
> > low-level kernels, as you can simply query the logical type of your data
> > and you automatically know its physical type.
> >
> > The model chosen for Arrow is ingrained in its API.  If we want to
> > change the model we'd better do it wholesale (implying probably a large
> > refactoring and a significant number of unavoidable regressions) to
> > avoid subjecting users to a confusing middle point.
> >
> > Also and as a sidenote, "convertibility" between different types can be
> > a hairy subject... Having strict boundaries between types avoids being
> > dragged into it too early.
> >
> >
> > To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> > should be a distinct logical type from List (resp. Binary), the same way
> > Int64 is a distinct logical type from Int32.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > > Hello,
> > >
> > > I would like understand where do we stand on logical types and physical
> > > types. As I understand, this proposal is for the physical
> representation.
> > >
> > > In the context of an execution engine, the concept of logical types
> > becomes
> > > more important as two physical representation might have the same
> > semantical
> > > values, e.g. LargeList and List where all values fits in the 32-bits.
> A
> > > more
> > > complex example would be an Integer array and a dictionary array where
> > > values
> > > are integers.
> > >
> > > Is this something only something only relevant for execution engine?
> What
> > > about
> > > the (C++) Array.Equals method and related comparisons methods? This
> also
> > > touch
> > > the subject of type equality, e.g. dict with different but compatible
> > > encoding.
> > >
> > > Jacques, knowing that you worked on Parquet (which follows this model)
> > and
> > > Dremio,
> > > what is your opinion?
> > >
> > > François
> > >
> > > Some related tickets:
> > > - https://jira.apache.org/jira/browse/ARROW-554
> > > - https://jira.apache.org/jira/browse/ARROW-1741
> > > - https://jira.apache.org/jira/browse/ARROW-3144
> > > - https://jira.apache.org/jira/browse/ARROW-4097
> > > - https://jira.apache.org/jira/browse/ARROW-5052
> > >
> > >
> > >
> > > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield  >
> > > wrote:
> > >
> > >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> > offsets
> > >> to Lists, Strings and binary data types.
> > >>
> > >> Philipp started an implementation for the large list type [3] and I
> > hacked
> > >> together a 

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Francois Saint-Jacques
Thanks for the clarification Antoine, very insightful.

I'd also vote for keeping the existing model for consistency.

On Mon, Apr 15, 2019 at 1:40 PM Antoine Pitrou  wrote:

>
> Hi,
>
> I am not Jacques, but I will try to give my own point of view on this.
>
> The distinction between logical and physical types can be modelled in
> two different ways:
>
> 1) a physical type can denote several logical types, but a logical type
> can only have a single physical representation.  This is currently the
> Arrow model.
>
> 2) a physical type can denote several logical types, and a logical type
> can also be denoted by several physical types.  This is the Parquet model.
>
> (theoretically, there are two other possible models, but they are not
> very interesting to consider, since they don't seem to cater to concrete
> use cases)
>
> Model 1 is obviously more restrictive, while model 2 is more flexible.
> Model 2 could be said "higher level"; you see something similar if you
> compare Python's and C++'s typing systems.  On the other hand, model 1
> provides a potentially simpler programming model for implementors of
> low-level kernels, as you can simply query the logical type of your data
> and you automatically know its physical type.
>
> The model chosen for Arrow is ingrained in its API.  If we want to
> change the model we'd better do it wholesale (implying probably a large
> refactoring and a significant number of unavoidable regressions) to
> avoid subjecting users to a confusing middle point.
>
> Also and as a sidenote, "convertibility" between different types can be
> a hairy subject... Having strict boundaries between types avoids being
> dragged into it too early.
>
>
> To return to the original subject: IMHO, LargeList (resp. LargeBinary)
> should be a distinct logical type from List (resp. Binary), the same way
> Int64 is a distinct logical type from Int32.
>
> Regards
>
> Antoine.
>
>
>
> Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> > Hello,
> >
> > I would like understand where do we stand on logical types and physical
> > types. As I understand, this proposal is for the physical representation.
> >
> > In the context of an execution engine, the concept of logical types
> becomes
> > more important as two physical representation might have the same
> semantical
> > values, e.g. LargeList and List where all values fits in the 32-bits.  A
> > more
> > complex example would be an Integer array and a dictionary array where
> > values
> > are integers.
> >
> > Is this something only something only relevant for execution engine? What
> > about
> > the (C++) Array.Equals method and related comparisons methods? This also
> > touch
> > the subject of type equality, e.g. dict with different but compatible
> > encoding.
> >
> > Jacques, knowing that you worked on Parquet (which follows this model)
> and
> > Dremio,
> > what is your opinion?
> >
> > François
> >
> > Some related tickets:
> > - https://jira.apache.org/jira/browse/ARROW-554
> > - https://jira.apache.org/jira/browse/ARROW-1741
> > - https://jira.apache.org/jira/browse/ARROW-3144
> > - https://jira.apache.org/jira/browse/ARROW-4097
> > - https://jira.apache.org/jira/browse/ARROW-5052
> >
> >
> >
> > On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield 
> > wrote:
> >
> >> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> offsets
> >> to Lists, Strings and binary data types.
> >>
> >> Philipp started an implementation for the large list type [3] and I
> hacked
> >> together a potentially viable java implementation [4]
> >>
> >> I'd like to kickoff the discussion for getting these types voted on.
> I'm
> >> coupling them together because I think there are design consideration
> for
> >> how we evolve Schema.fbs
> >>
> >> There are two proposed options:
> >> 1.  The current PR proposal which adds a new type LargeList:
> >>   // List with 64-bit offsets
> >>   table LargeList {}
> >>
> >> 2.  As François suggested, it might cleaner to parameterize List with
> >> offset width.  I suppose something like:
> >>
> >> table List {
> >>   // only 32 bit and 64 bit is supported.
> >>   bitWidth: int = 32;
> >> }
> >>
> >> I think Option 2 is cleaner and potentially better long-term, but I
> think
> >> it breaks forward compatibility of the existing arrow libraries.  If we
> >> proceed with Option 2, I would advocate making the change to Schema.fbs
> all
> >> at once for all types (assuming we think that 64-bit offsets are
> desirable
> >> for all types) along with future compatibility checks to avoid multiple
> >> releases were future compatibility is broken (by broken I mean the
> >> inability to detect that an implementation is receiving data it can't
> >> read).What are peoples thoughts on this?
> >>
> >> Also, any other concern with adding these types?
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://issues.apache.org/jira/browse/ARROW-4810
> >> [2] https://issues.apache.org/jira/browse/ARROW-750
> >> [3] 

Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Antoine Pitrou


Hi,

I am not Jacques, but I will try to give my own point of view on this.

The distinction between logical and physical types can be modelled in
two different ways:

1) a physical type can denote several logical types, but a logical type
can only have a single physical representation.  This is currently the
Arrow model.

2) a physical type can denote several logical types, and a logical type
can also be denoted by several physical types.  This is the Parquet model.

(theoretically, there are two other possible models, but they are not
very interesting to consider, since they don't seem to cater to concrete
use cases)

Model 1 is obviously more restrictive, while model 2 is more flexible.
Model 2 could be said "higher level"; you see something similar if you
compare Python's and C++'s typing systems.  On the other hand, model 1
provides a potentially simpler programming model for implementors of
low-level kernels, as you can simply query the logical type of your data
and you automatically know its physical type.

The model chosen for Arrow is ingrained in its API.  If we want to
change the model we'd better do it wholesale (implying probably a large
refactoring and a significant number of unavoidable regressions) to
avoid subjecting users to a confusing middle point.

Also and as a sidenote, "convertibility" between different types can be
a hairy subject... Having strict boundaries between types avoids being
dragged into it too early.


To return to the original subject: IMHO, LargeList (resp. LargeBinary)
should be a distinct logical type from List (resp. Binary), the same way
Int64 is a distinct logical type from Int32.

Regards

Antoine.



Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> Hello,
> 
> I would like understand where do we stand on logical types and physical
> types. As I understand, this proposal is for the physical representation.
> 
> In the context of an execution engine, the concept of logical types becomes
> more important as two physical representation might have the same semantical
> values, e.g. LargeList and List where all values fits in the 32-bits.  A
> more
> complex example would be an Integer array and a dictionary array where
> values
> are integers.
> 
> Is this something only something only relevant for execution engine? What
> about
> the (C++) Array.Equals method and related comparisons methods? This also
> touch
> the subject of type equality, e.g. dict with different but compatible
> encoding.
> 
> Jacques, knowing that you worked on Parquet (which follows this model) and
> Dremio,
> what is your opinion?
> 
> François
> 
> Some related tickets:
> - https://jira.apache.org/jira/browse/ARROW-554
> - https://jira.apache.org/jira/browse/ARROW-1741
> - https://jira.apache.org/jira/browse/ARROW-3144
> - https://jira.apache.org/jira/browse/ARROW-4097
> - https://jira.apache.org/jira/browse/ARROW-5052
> 
> 
> 
> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield 
> wrote:
> 
>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit offsets
>> to Lists, Strings and binary data types.
>>
>> Philipp started an implementation for the large list type [3] and I hacked
>> together a potentially viable java implementation [4]
>>
>> I'd like to kickoff the discussion for getting these types voted on.  I'm
>> coupling them together because I think there are design consideration for
>> how we evolve Schema.fbs
>>
>> There are two proposed options:
>> 1.  The current PR proposal which adds a new type LargeList:
>>   // List with 64-bit offsets
>>   table LargeList {}
>>
>> 2.  As François suggested, it might cleaner to parameterize List with
>> offset width.  I suppose something like:
>>
>> table List {
>>   // only 32 bit and 64 bit is supported.
>>   bitWidth: int = 32;
>> }
>>
>> I think Option 2 is cleaner and potentially better long-term, but I think
>> it breaks forward compatibility of the existing arrow libraries.  If we
>> proceed with Option 2, I would advocate making the change to Schema.fbs all
>> at once for all types (assuming we think that 64-bit offsets are desirable
>> for all types) along with future compatibility checks to avoid multiple
>> releases were future compatibility is broken (by broken I mean the
>> inability to detect that an implementation is receiving data it can't
>> read).What are peoples thoughts on this?
>>
>> Also, any other concern with adding these types?
>>
>> Thanks,
>> Micah
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-4810
>> [2] https://issues.apache.org/jira/browse/ARROW-750
>> [3] https://github.com/apache/arrow/pull/3848
>> [4]
>>
>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
>>
> 


Re: What's the proper procedure to publish a docker image to dockerhub?

2019-04-15 Thread Alberto Ramón
Hello Zhiyuan

I can help you if you need help with this process
The best option is request a offical repository for Apache Arrow Project (se
are the ones that start with '_'  Redis example


On Mon, 15 Apr 2019 at 15:21, Zhiyuan Zheng 
wrote:

> Hi,
>
> DataFusion is a component which is an in-memory query engine using Apache
> Arrow as the memory model.
>
> I have created a Dockerfile for DataFusion (
> https://issues.apache.org/jira/browse/ARROW-4467) for it.
>
> In order to help user to start using DataFusion for some simple real world
> use cases, I would like to publish a docker image with tag
> 'apache/arrow-datafusion' to the DockerHub.
>
> What's the procedure to publish a docker image to DockerHub prefixed with
> apache?
>
> Cheers,
> Zhiyuan
>


Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Francois Saint-Jacques
Hello,

I would like understand where do we stand on logical types and physical
types. As I understand, this proposal is for the physical representation.

In the context of an execution engine, the concept of logical types becomes
more important as two physical representation might have the same semantical
values, e.g. LargeList and List where all values fits in the 32-bits.  A
more
complex example would be an Integer array and a dictionary array where
values
are integers.

Is this something only something only relevant for execution engine? What
about
the (C++) Array.Equals method and related comparisons methods? This also
touch
the subject of type equality, e.g. dict with different but compatible
encoding.

Jacques, knowing that you worked on Parquet (which follows this model) and
Dremio,
what is your opinion?

François

Some related tickets:
- https://jira.apache.org/jira/browse/ARROW-554
- https://jira.apache.org/jira/browse/ARROW-1741
- https://jira.apache.org/jira/browse/ARROW-3144
- https://jira.apache.org/jira/browse/ARROW-4097
- https://jira.apache.org/jira/browse/ARROW-5052



On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield 
wrote:

> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit offsets
> to Lists, Strings and binary data types.
>
> Philipp started an implementation for the large list type [3] and I hacked
> together a potentially viable java implementation [4]
>
> I'd like to kickoff the discussion for getting these types voted on.  I'm
> coupling them together because I think there are design consideration for
> how we evolve Schema.fbs
>
> There are two proposed options:
> 1.  The current PR proposal which adds a new type LargeList:
>   // List with 64-bit offsets
>   table LargeList {}
>
> 2.  As François suggested, it might cleaner to parameterize List with
> offset width.  I suppose something like:
>
> table List {
>   // only 32 bit and 64 bit is supported.
>   bitWidth: int = 32;
> }
>
> I think Option 2 is cleaner and potentially better long-term, but I think
> it breaks forward compatibility of the existing arrow libraries.  If we
> proceed with Option 2, I would advocate making the change to Schema.fbs all
> at once for all types (assuming we think that 64-bit offsets are desirable
> for all types) along with future compatibility checks to avoid multiple
> releases were future compatibility is broken (by broken I mean the
> inability to detect that an implementation is receiving data it can't
> read).What are peoples thoughts on this?
>
> Also, any other concern with adding these types?
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-4810
> [2] https://issues.apache.org/jira/browse/ARROW-750
> [3] https://github.com/apache/arrow/pull/3848
> [4]
>
> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
>


Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)

2019-04-15 Thread Francois Saint-Jacques
I think that this is unavoidable. Even with batches, taking an example of a
binary column where the mean size of the payload is 1mb, it limits to
batches of 2048 elements. This can become annoying pretty quickly.

François

On Fri, Apr 12, 2019 at 11:15 PM Wes McKinney  wrote:

> Hi Jacques,
>
> I think there are different use cases. What we are seeing now in many
> places is the desire to use the Arrow format to represent very large
> on-disk datasets (eg memory-mapped). If we don't support this use case it
> will continue to cause tension in adoption and result in some applications
> choosing to not use Arrow for things, with subsequent ecosystem
> fragmentation. The idea of breaking work into smaller batches seems like an
> analytics / database- centric viewpoint and the project has already grown
> significantly beyond that (eg considering uses for Plasma in HPC / machine
> learning applications).
>
> There is a stronger argument IMHO for LargeBinary compared with LargeList
> because opaque objects may routinely exceed 2GB. I have just developed an
> ExtensionType facility in C++ (which I hope to see implemented soon in Java
> and elsewhere) to make this more useful / natural at an API level (where
> objects can be automatically unboxed into user types).
>
> Personally I would rather address the 64-bit offset issue now so that I
> stop hearing the objection from would-be users (I can count a dozen or so
> occasions where I've been accosted in person over this issue at conferences
> and elsewhere). It would be a good idea to recommend a preference for
> 32-bit offsets in our documentation.
>
> Wes
>
>
> On Fri, Apr 12, 2019, 10:35 PM Jacques Nadeau  wrote:
>
> > Definitely prefer option 1.
> >
> > I'm a -0.5 on the change in general. I think that early on users may want
> > to pattern things this way but as you start trying to parallelize work,
> > pipeline work, etc, moving beyond moderate batch sizes is ultimately a
> > different use case and won't be supported well within code that is
> > expecting to work with smaller data structures. A good example might be
> > doing a join between two datasets that will not fit in memory. Trying to
> > solve that with individual cells and records of the size proposed here is
> > probably an extreme edge case and thus won't be handled in the algorithms
> > people will implement. So you'll effectively get into this situation of
> > second-class datatypes that really aren't supported by most things.
> >
> > On Thu, Apr 11, 2019 at 2:06 PM Philipp Moritz 
> wrote:
> >
> > > Thanks for getting the discussion started, Micah!
> > >
> > > I'm +1 on this change and also slightly prefer 1. As Antoine mentions,
> > > there doesn't seem to be a clear benefit from 2, unless we want to also
> > > support 8 or 16 bit indices in the future, which seems unlikely. So
> going
> > > with 1 is ok I think.
> > >
> > > Best,
> > > Philipp.
> > >
> > > On Thu, Apr 11, 2019 at 7:06 AM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Le 11/04/2019 à 10:52, Micah Kornfield a écrit :
> > > > > ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit
> > > offsets
> > > > > to Lists, Strings and binary data types.
> > > > >
> > > > > Philipp started an implementation for the large list type [3] and I
> > > > hacked
> > > > > together a potentially viable java implementation [4]
> > > > >
> > > > > I'd like to kickoff the discussion for getting these types voted
> on.
> > > I'm
> > > > > coupling them together because I think there are design
> consideration
> > > for
> > > > > how we evolve Schema.fbs
> > > > >
> > > > > There are two proposed options:
> > > > > 1.  The current PR proposal which adds a new type LargeList:
> > > > >   // List with 64-bit offsets
> > > > >   table LargeList {}
> > > > >
> > > > > 2.  As François suggested, it might cleaner to parameterize List
> with
> > > > > offset width.  I suppose something like:
> > > > >
> > > > > table List {
> > > > >   // only 32 bit and 64 bit is supported.
> > > > >   bitWidth: int = 32;
> > > > > }
> > > > >
> > > > > I think Option 2 is cleaner and potentially better long-term, but I
> > > think
> > > > > it breaks forward compatibility of the existing arrow libraries.
> If
> > we
> > > > > proceed with Option 2, I would advocate making the change to
> > Schema.fbs
> > > > all
> > > > > at once for all types (assuming we think that 64-bit offsets are
> > > > desirable
> > > > > for all types) along with future compatibility checks to avoid
> > multiple
> > > > > releases were future compatibility is broken (by broken I mean the
> > > > > inability to detect that an implementation is receiving data it
> can't
> > > > > read).What are peoples thoughts on this?
> > > >
> > > > I think Option 1 is ok.  Making List / String / Binary
> parameterizable
> > > > doesn't bring anything *concretely*, since the types will not be
> > > > physically interchangeable.  The cost of breaking compatibility
> should
> > > > be offset by a 

What's the proper procedure to publish a docker image to dockerhub?

2019-04-15 Thread Zhiyuan Zheng
Hi,

DataFusion is a component which is an in-memory query engine using Apache Arrow 
as the memory model. 

I have created a Dockerfile for DataFusion 
(https://issues.apache.org/jira/browse/ARROW-4467) for it.

In order to help user to start using DataFusion for some simple real world use 
cases, I would like to publish a docker image with tag 
'apache/arrow-datafusion' to the DockerHub.

What's the procedure to publish a docker image to DockerHub prefixed with 
apache?

Cheers,
Zhiyuan


[jira] [Created] (ARROW-5170) [Rust][Datafusion] Add datafusion-cli to the docker-compose setup

2019-04-15 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5170:
--

 Summary: [Rust][Datafusion] Add datafusion-cli to the 
docker-compose setup
 Key: ARROW-5170
 URL: https://issues.apache.org/jira/browse/ARROW-5170
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust - DataFusion
Reporter: Krisztian Szucs


See dockerfile: 
https://github.com/apache/arrow/pull/4147/files#diff-2233b99f64d6acbef4e9d964e29fa76bR18



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Python/C++] Help running HDFS integration test for C++/Python

2019-04-15 Thread Krisztián Szűcs
Or: `make -f Makefile.docker run-hdfs-integration`
dev/container/README.md is definitely outdated


On Mon, Apr 15, 2019 at 9:09 AM Krisztián Szűcs 
wrote:

> Hey Micah,
>
> Try the following [1]:
>
> export PYTHON_VERSION=3.6
> docker-compose build cpp
> docker-compose build python
> docker-compose build hdfs-integration
>  docker-compose run hdfs-integration
>
> [1]: https://github.com/apache/arrow/blob/master/docker-compose.yml#L444
>
> On Mon, Apr 15, 2019 at 8:38 AM Micah Kornfield 
> wrote:
>
>> I'm trying to verify a PR [1] works with the HDFS integration test, but I
>> am having trouble running docker compose.
>>
>> I am trying to follow the README.md in dev [2] but I am running into an
>> error "pull access denied for arrow".  The full command and error are
>> pasted below. (I think there is also a typo in the docs "_" instead of
>> "-").  I'm new to docker and docker-compose in general, so please excuse
>> me
>> if this is something obvious.
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> Micah
>>
>> Environment details:
>> Ubuntu 18.
>> Docker version 18.06.1-ce, build e68fc7a
>>
>> docker-compose version 1.17.1, build 6d101fb
>>
>> Full run:
>>
>> ./run_docker_compose.sh hdfs-integration
>>
>> Building hdfs-integration
>>
>> Step 1/7 : FROM arrow:python-3.6
>>
>> ERROR: Service 'hdfs-integration' failed to build: pull access denied for
>> arrow, repository does not exist or may require 'docker login'
>>
>>
>>
>> [1] https://github.com/apache/arrow/pull/4081
>> [2] https://github.com/apache/arrow/blob/master/dev/README.md
>>
>


Re: [Python/C++] Help running HDFS integration test for C++/Python

2019-04-15 Thread Krisztián Szűcs
Hey Micah,

Try the following [1]:

export PYTHON_VERSION=3.6
docker-compose build cpp
docker-compose build python
docker-compose build hdfs-integration
 docker-compose run hdfs-integration

[1]: https://github.com/apache/arrow/blob/master/docker-compose.yml#L444

On Mon, Apr 15, 2019 at 8:38 AM Micah Kornfield 
wrote:

> I'm trying to verify a PR [1] works with the HDFS integration test, but I
> am having trouble running docker compose.
>
> I am trying to follow the README.md in dev [2] but I am running into an
> error "pull access denied for arrow".  The full command and error are
> pasted below. (I think there is also a typo in the docs "_" instead of
> "-").  I'm new to docker and docker-compose in general, so please excuse me
> if this is something obvious.
>
> Any help would be appreciated.
>
> Thanks,
> Micah
>
> Environment details:
> Ubuntu 18.
> Docker version 18.06.1-ce, build e68fc7a
>
> docker-compose version 1.17.1, build 6d101fb
>
> Full run:
>
> ./run_docker_compose.sh hdfs-integration
>
> Building hdfs-integration
>
> Step 1/7 : FROM arrow:python-3.6
>
> ERROR: Service 'hdfs-integration' failed to build: pull access denied for
> arrow, repository does not exist or may require 'docker login'
>
>
>
> [1] https://github.com/apache/arrow/pull/4081
> [2] https://github.com/apache/arrow/blob/master/dev/README.md
>


Re: [Python/C++] Help running HDFS integration test for C++/Python

2019-04-15 Thread Micah Kornfield
Thanks to a hint from Wes, I found the
https://github.com/apache/arrow/tree/master/dev/container/README.md which I
think I need to run through first before running the steps in the
dev/README, I'll try this and update the documentation if this seems to
work.

On Sun, Apr 14, 2019 at 11:37 PM Micah Kornfield 
wrote:

> I'm trying to verify a PR [1] works with the HDFS integration test, but I
> am having trouble running docker compose.
>
> I am trying to follow the README.md in dev [2] but I am running into an
> error "pull access denied for arrow".  The full command and error are
> pasted below. (I think there is also a typo in the docs "_" instead of
> "-").  I'm new to docker and docker-compose in general, so please excuse me
> if this is something obvious.
>
> Any help would be appreciated.
>
> Thanks,
> Micah
>
> Environment details:
> Ubuntu 18.
> Docker version 18.06.1-ce, build e68fc7a
>
> docker-compose version 1.17.1, build 6d101fb
>
> Full run:
>
> ./run_docker_compose.sh hdfs-integration
>
> Building hdfs-integration
>
> Step 1/7 : FROM arrow:python-3.6
>
> ERROR: Service 'hdfs-integration' failed to build: pull access denied for
> arrow, repository does not exist or may require 'docker login'
>
>
>
> [1] https://github.com/apache/arrow/pull/4081
> [2] https://github.com/apache/arrow/blob/master/dev/README.md
>


[Python/C++] Help running HDFS integration test for C++/Python

2019-04-15 Thread Micah Kornfield
I'm trying to verify a PR [1] works with the HDFS integration test, but I
am having trouble running docker compose.

I am trying to follow the README.md in dev [2] but I am running into an
error "pull access denied for arrow".  The full command and error are
pasted below. (I think there is also a typo in the docs "_" instead of
"-").  I'm new to docker and docker-compose in general, so please excuse me
if this is something obvious.

Any help would be appreciated.

Thanks,
Micah

Environment details:
Ubuntu 18.
Docker version 18.06.1-ce, build e68fc7a

docker-compose version 1.17.1, build 6d101fb

Full run:

./run_docker_compose.sh hdfs-integration

Building hdfs-integration

Step 1/7 : FROM arrow:python-3.6

ERROR: Service 'hdfs-integration' failed to build: pull access denied for
arrow, repository does not exist or may require 'docker login'



[1] https://github.com/apache/arrow/pull/4081
[2] https://github.com/apache/arrow/blob/master/dev/README.md